brantfaircloth/convert-bcl-to-fastq.rst

## convert-bcl-to-fastq.rst

      
    Raw
  

              convert-bcl-to-fastq.rst
            
          
Install dependencies and Casava

The following assumes you are converting BCL files containing PE100 reads with a 10 nt index read.
You can allow Casava to demultiplex for you or do it on your own, later.  You can adjust values below
if you are doing something different (e.g. shorter reads, longer indexes) but be careful.

You need a pretty beefy machine.  Illumina recommends something with multiple cores and 48 GB RAM, running Centos 5.
Centos 6 also works just fine.  See their recommendations here:
http://support.illumina.com/sequencing/sequencing_software/casava/computing_requirements.ilmn

you need compilers, etc. installed:
yum groupinstall "Development Tools"


Casava 1.8.2 (or whichever version) has many dependencies.  You can meet them all pretty easily on
Centos 5/6 using yum install packagename.  You want the x86_64 versions:
GNU make (3.81 recommended)
Perl (>= 5.8)
Python (>=2.3 and <=2.6)
PyXML
gnuplot (>= 3.7, 4.0 recommended)
ImageMagick (>= 5.4.7)
ghostscript
libxslt
libxslt-devel
libxml2
libxml2-devel
libxml2-python
ncurses
ncurses-devel
libtiff
libtiff-devel
bzip2
bzip2-devel
zlib
zlib-devel

Perl modules:
perl-XML-Dumper
perl-XML-Grove
perl-XML-LibXML
perl-XML-LibXML-Common
perl-XML-NamespaceSupport
perl-XML-Parser
perl-XML-SAX
perl-XML-Simple
perl-XML-Twig
perldoc


Get Casava 1.8.2 (or whichever version) from Illumina.

Build the software according to the installation documents.  I installed in my $HOME,
after installing most dependencies as root (except for PyXML, which lives in $HOME so
as not to pollute the system site-packages):
./configure --prefix=$HOME
make
make install


Make sure the install location is in your $PATH


Prepare the data


If you have an entire flowcell of data from the HiSeq, you may want to pare this down to a single lane.
You need the following directory structure - I'll use L008 as an example below, but it could be any lane.
An asterisk below is a wilcard, representing all files of that type.  You want to copy the files below from
the entire flowcell into a new directory that represents your lane. For several of the XML files, they are not
split by lane, so just get the main file.  I'm working on a python script that will do this automatically.
Note:  you should replace L008 and the 8 in s_8_*_pos.txt below with the lane containing your data:
- Date_InstrumentNumber_run/
    - RunInfo.xml
    - runParameters.xml
    Data/
        Intensities/
            - config.xml
            - RTAConfiguration.xml
            - s_8_*_pos.txt
            L008/*
            BaseCalls/
                - config.xml
                - SampleSheet.csv
                L008/*


Setup your SampleSheet.csv properly.  If it is not configured correctly, the program may run, but you may
get no output, which is confusing.  In the following:
FCID - the instrument ID. Must place value here.
Lane - the lane id.  Must place value here.
SampleID - the sample ID.  Must place value here.
SampleRef - the reference genome.  Can be empty.
Index - the index sequence.  Can be empty.
Description - description of what you're doing. Can be empty.
Control - Y or N indicating control lane
Recipe - the recipe name.  Must place value here.
Operator - your name/initials.  Must place value here.
SampleProject - the name of the "project".  Becomes directory holding your files. Must place value here.


Sample sheet to return data and indexes, only - no demultiplexing, just reads with indexes (assuming you sequenced indexes):
FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject
D109LACXX,8,not_demultiplexed,,,Test bcl conversion,N,D109LACXX,BCF,testbclconv


Sample sheet to return demultiplexed data, according to values within.  Note that my barcodes are 10 nt here:
FCID,Lane,SampleID,SampleRef,Index,Description,Control,Recipe,Operator,SampleProject
D109LACXX,8,my_sample_1,,AACCGAGTTA,Test demultiplexing,N,D109LACXX,BCF,testdmux
D109LACXX,8,my_sample_2,,AATACTTCCG,Test demultiplexing,N,D109LACXX,BCF,testdmux
D109LACXX,8,my_sample_3,,AACAACAACC,Test demultiplexing,N,D109LACXX,BCF,testdmux


Depending on what you want to do and where you're storing the resulting data, get your paths in order (know what they are).


Generate the Makefile


Standard Demultiplexing


For standard demultiplexing with no error correction, after you enter your TruSeq indexes in the sample sheet, then:
configureBclToFastq.pl \
    --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
    --output-dir /where/you/want/the/output \
    --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv


Standard Demultiplexing and correcting 1 index error

The standard TruSeq indexes allow you to correct one substitution error within their sequence.

For demultiplexing longer indexes (nt > 9) with no error correction, you need to explicitly pass --use-bases-mask and have the entire
index sequence in your SampleSheet.csv, or you'll get an error:
configureBclToFastq.pl \
    --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
    --output-dir /where/you/want/the/output \
    --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
    --mismatches 1


Demultiplexing Long (>9 nt) Indexes


For standard demultiplexing with 1 error correction (substitutions), after you enter your TruSeq indexes in the sample sheet, you need to explicitly pass `--mismatches`then:
configureBclToFastq.pl \
    --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
    --output-dir /where/you/want/the/output \
    --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
    --use-bases-mask Y*,I*,Y*


Demultiplexing Long (>9 nt) Indexes and correcting 1 index error


For demultiplexing longer indexes (nt > 9) with 1 error correction (substitutions), you need to explicitly pass --use-bases-mask and --mismatches and have the entire
index sequence in your SampleSheet.csv, or you'll get an error:
configureBclToFastq.pl \
    --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
    --output-dir /where/you/want/the/output \
    --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
    --use-bases-mask Y*,I*,Y* --mismatches 1


No Demultiplexing, just output reads and indices


If you want to process the data to fastq and demultiplex the data using an external method, you need to input a
sample sheet with no demultiplexing requested (see above).  The assuming you have 10 nucleotide indexes, run:
configureBclToFastq.pl \
    --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
    --output-dir /where/you/want/the/output \
    --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
    --use-bases-mask Y*,Y10,Y*


This will output R1 (first 100 bp read), R2 (index read), and R3 (second 100 bp read) files rather than the
"normal" R1 and R2 files that contain an index sequence.  You will need to subsequently manipulate the files
to prepare them for your downstream demultiplexing code.  Thanks go to the excellent Illumina Tech Support
staff for the solution above (and below).

If this causes problems you may need to specifically state the R1 and R3 read lengths.  So, if you did a PE100
run, then:
configureBclToFastq.pl \
    --input-dir /your/path/to/run_folder/Data/Intensities/BaseCalls \
    --output-dir /where/you/want/the/output \
    --sample-sheet /my/path/to/run_folder/Data/Intensities/BaseCalls/SampleSheet.csv \
    --use-bases-mask Y100,Y10,Y100


Run the conversion/demultiplexing


change to the output directory you just created:
cd /where/you/want/the/output


run make:
make


if you have multiple processing cores (where N = number of cores):
make -j N