sgsfak/1000genomes_reprod_align.md

## 1000genomes_reprod_align.md

      
    Raw
  

              1000genomes_reprod_align.md
            
          
    Reproducing (some of) the sequence alignment processes of 1000genomes project

Download the 100genomes samples

First I download the sequence.index file from the 1000genomes EBI FRP site:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index
This is a tab separated file with the following columns:
head -1 sequence.index | tr '\t' '\n' | nl
##      1	FASTQ_FILE
##      2	MD5
##      3	RUN_ID
##      4	STUDY_ID
##      5	STUDY_NAME
##      6	CENTER_NAME
##      7	SUBMISSION_ID
##      8	SUBMISSION_DATE
##      9	SAMPLE_ID
##     10	SAMPLE_NAME
##     11	POPULATION
##     12	EXPERIMENT_ID
##     13	INSTRUMENT_PLATFORM
##     14	INSTRUMENT_MODEL
##     15	LIBRARY_NAME
##     16	RUN_NAME
##     17	RUN_BLOCK_NAME
##     18	INSERT_SIZE
##     19	LIBRARY_LAYOUT
##     20	PAIRED_FASTQ
##     21	WITHDRAWN
##     22	WITHDRAWN_DATE
##     23	COMMENT
##     24	READ_COUNT
##     25	BASE_COUNT
##     26	ANALYSIS_GROUP

We are going to select only the samples that have been processed by ILLUMINA
"instrument platform" are of the "PAIRED" library layout, have not been
withdrawn, and are 'low coverage'. Additionally we select only the rows that
have both "pairs" -- they have a filename ending in "_1" or "_2" and they have
been processed by NCBI (USA, with the "SRR" prefix):
<sequence.index awk -F "\t" -v OFS="," \
   '$13=="ILLUMINA" && $19=="PAIRED" && $21=="0" && $26=="low coverage" && \
    $1 ~/_[12].filt.fastq.gz$/ &&
    $3 ~/^SRR/ {print $1,$2,$3,$9}' \
   | sort -k3,4 -t,  > illumina.paired.lowcov.csv
head illumina.paired.lowcov.csv | column -t -s,
## data/NA18498/sequence_read/SRR003040_1.filt.fastq.gz  846a12b80f63d8d38896aa4aff173ca8  SRR003040  SRS000095
## data/NA18498/sequence_read/SRR003040_2.filt.fastq.gz  3094a557fecaffe0f3fb4b494bbe798c  SRR003040  SRS000095
## data/NA18498/sequence_read/SRR003041_1.filt.fastq.gz  7b7ea75fd54813a7569303a196d0dc6d  SRR003041  SRS000095
## data/NA18498/sequence_read/SRR003041_2.filt.fastq.gz  4dc4a2d84b8392357a30567a97d109af  SRR003041  SRS000095
## data/NA18498/sequence_read/SRR003042_1.filt.fastq.gz  e475d8d5dfc6ddd98c7273ba4e8a818f  SRR003042  SRS000095
## data/NA18498/sequence_read/SRR003042_2.filt.fastq.gz  15d983fb9790c707c0702ca9517d458b  SRR003042  SRS000095
## data/NA18522/sequence_read/SRR003071_1.filt.fastq.gz  28c18fb753b557da24bc94475c45c937  SRR003071  SRS000109
## data/NA18522/sequence_read/SRR003071_2.filt.fastq.gz  204d840102b45dcded7ec2cee3287b4c  SRR003071  SRS000109
## data/NA18522/sequence_read/SRR003072_1.filt.fastq.gz  736d425aa486267c21e1236408dead14  SRR003072  SRS000109
## data/NA18522/sequence_read/SRR003072_2.filt.fastq.gz  b75c13ab99056598fedb0cd6e920f3c3  SRR003072  SRS000109

We keep only the file, MD5, "run id", and sample id columns in the illumina.paired.lowcov.csv output file
How many samples do we have?
cut -f4 -d, illumina.paired.lowcov.csv|sort | uniq -c | wc -l
## 1048

But there are multiple submissions of fastq files for the same sample!
So I am going to choose one of those "runs" per sample:
cut -f3,4 -d, illumina.paired.lowcov.csv \
    | tr ',' ' '\
    | sort -k 2 \
    | uniq -f 1 \
    | tr ' ' ',' > selected_runs.csv
head selected_runs.csv | column -t -s,  
## SRR400038  SRS000030
## SRR003667  SRS000031
## SRR385751  SRS000032
## SRR003539  SRS000035
## SRR006016  SRS000037
## SRR211280  SRS000039
## SRR006171  SRS000041
## SRR006092  SRS000042
## SRR006118  SRS000043
## SRR023299  SRS000044

Merge (join) the two files to filter and choose one the runids for each sample:
join -1 2 <(sed 's/,/ /2' illumina.paired.lowcov.csv) <(sort selected_runs.csv) \
   | tr ' ' , > sel.csv
head sel.csv | column -t -s, 
## SRR003040  SRS000095  data/NA18498/sequence_read/SRR003040_1.filt.fastq.gz  846a12b80f63d8d38896aa4aff173ca8
## SRR003040  SRS000095  data/NA18498/sequence_read/SRR003040_2.filt.fastq.gz  3094a557fecaffe0f3fb4b494bbe798c
## SRR003071  SRS000109  data/NA18522/sequence_read/SRR003071_1.filt.fastq.gz  28c18fb753b557da24bc94475c45c937
## SRR003071  SRS000109  data/NA18522/sequence_read/SRR003071_2.filt.fastq.gz  204d840102b45dcded7ec2cee3287b4c
## SRR003107  SRS000142  data/NA18856/sequence_read/SRR003107_1.filt.fastq.gz  3972ce3099fb30d4d265b3a07eb251fe
## SRR003107  SRS000142  data/NA18856/sequence_read/SRR003107_2.filt.fastq.gz  36145f73fb00720dd0ecb0eddedd2acd
## SRR003114  SRS000050  data/NA11920/sequence_read/SRR003114_1.filt.fastq.gz  6c06635fc52da0f1c71a4c53d91382ca
## SRR003114  SRS000050  data/NA11920/sequence_read/SRR003114_2.filt.fastq.gz  c47518c16c8dbd31014cb5141aeb89b5
## SRR003121  SRS000065  data/NA12155/sequence_read/SRR003121_1.filt.fastq.gz  79ddf87d07af42ddb2ab26444145658f
## SRR003121  SRS000065  data/NA12155/sequence_read/SRR003121_2.filt.fastq.gz  28b25157e29d1e89493b19d8d9468dd3

And finally construct a file with the "_1" and "_2" files for each selected sample and run id:
paste - - <sel.csv | tr '\t' ',' | cut -f5,6 --complement -d, > final_selected.csv
head final_selected.csv | column -t -s,
## SRR003040  SRS000095  data/NA18498/sequence_read/SRR003040_1.filt.fastq.gz  846a12b80f63d8d38896aa4aff173ca8  data/NA18498/sequence_read/SRR003040_2.filt.fastq.gz  3094a557fecaffe0f3fb4b494bbe798c
## SRR003071  SRS000109  data/NA18522/sequence_read/SRR003071_1.filt.fastq.gz  28c18fb753b557da24bc94475c45c937  data/NA18522/sequence_read/SRR003071_2.filt.fastq.gz  204d840102b45dcded7ec2cee3287b4c
## SRR003107  SRS000142  data/NA18856/sequence_read/SRR003107_1.filt.fastq.gz  3972ce3099fb30d4d265b3a07eb251fe  data/NA18856/sequence_read/SRR003107_2.filt.fastq.gz  36145f73fb00720dd0ecb0eddedd2acd
## SRR003114  SRS000050  data/NA11920/sequence_read/SRR003114_1.filt.fastq.gz  6c06635fc52da0f1c71a4c53d91382ca  data/NA11920/sequence_read/SRR003114_2.filt.fastq.gz  c47518c16c8dbd31014cb5141aeb89b5
## SRR003121  SRS000065  data/NA12155/sequence_read/SRR003121_1.filt.fastq.gz  79ddf87d07af42ddb2ab26444145658f  data/NA12155/sequence_read/SRR003121_2.filt.fastq.gz  28b25157e29d1e89493b19d8d9468dd3
## SRR003128  SRS000056  data/NA12003/sequence_read/SRR003128_1.filt.fastq.gz  1a6c708f72579890d18e515de9f9b1ed  data/NA12003/sequence_read/SRR003128_2.filt.fastq.gz  207be25d385d5657f60e2460fa5c8692
## SRR003142  SRS000052  data/NA11992/sequence_read/SRR003142_1.filt.fastq.gz  812eaa35e21eada5f53200c96da41d9f  data/NA11992/sequence_read/SRR003142_2.filt.fastq.gz  696445a5798c5dcad8e41e10bdbd355e
## SRR003231  SRS000181  data/NA19093/sequence_read/SRR003231_1.filt.fastq.gz  c8458d8d381a74da5de159abdcefc232  data/NA19093/sequence_read/SRR003231_2.filt.fastq.gz  58144d5b105020e18ab6630cbdec3cd1
## SRR003258  SRS000094  data/NA18489/sequence_read/SRR003258_1.filt.fastq.gz  cfe616bc5792362e42b8880d4abd5cbb  data/NA18489/sequence_read/SRR003258_2.filt.fastq.gz  cd712d7c388e8cf2665d32d681fbfb0d
## SRR003293  SRS000184  data/NA19102/sequence_read/SRR003293_1.filt.fastq.gz  6a20235666770460a52fd0ecea440aac  data/NA19102/sequence_read/SRR003293_2.filt.fastq.gz  01357e6f209dd373e8d28351d608e0de

Alignment process

NOTE:The process shown here is the one proposed by the 1000genomes and can be found at
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.alignment_data in section B
Prerequisites:

The Burrows-Wheeler Aligner
SAMtools

Firstly we need the human reference genome that can be downloaded from
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
and unzipped. Then I used the bwa tool to index it:
bwa index reference_genome/human_g1k_v37.fasta
The indexing process took Real time: 6729.003 sec; CPU: 6723.132 sec
Let's see the process for one sample. I am going to take the first line of the final_selected.csv
and download the relevant files. The commands are:
head -1 final_selected.csv \
   | awk -F, -v OFS="" 'BEGIN {server = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/"}
      {
       print "wget ", server, $3, " -O- | gunzip >", $1, "_1.fastq";
       print "wget ", server, $5, " -O- | gunzip >", $1, "_2.fastq";
      }'
## wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA18498/sequence_read/SRR003040_1.filt.fastq.gz -O- | gunzip >SRR003040_1.fastq
## wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA18498/sequence_read/SRR003040_2.filt.fastq.gz -O- | gunzip >SRR003040_2.fastq

(Normally we should check the MD5 checksum but anyway...)
Then each of the files need to be aligned with the ref. genome as follows:
bwa aln -t 8 -q 15 -f SRR003040_1.sai reference_genome/human_g1k_v37.fasta SRR003040_1.fastq
bwa aln -t 8 -q 15 -f SRR003040_2.sai reference_genome/human_g1k_v37.fasta SRR003040_2.fastq
(The -t <num> option provides the number of threads to use, see the
manual)
In a typical run the (each of the) above command took Real time: 543.554 sec;
CPU: 3659.933 sec
Then we build a SAM file combining the "_1" and "_2" aligned pairs:
bwa sampe -f SRR003040.sam reference_genome/human_g1k_v37.fasta SRR003040_*.sai SRR003040_*.fastq
(Note: what about the "max insert size" (option -a <num>)? this "should be
taken to be 3 times the expected insert size" but what's the expected size??
In any case if not given bwa uses the default 500 and "this option is only
used when there are not enough good alignment to infer the distribution of
insert sizes.")
When I run it bwa reported "Real time: 2764.259 sec; CPU: 2762.281 sec"
The final step is to construct the BAM file from SAM, sort, fix mate information and add the MD tag. This
is done using SAMtools as follows:
samtools view -bSu SRR003040.sam \
	| samtools sort -n -o - samtools_nsort_tmp  \
	| samtools fixmate /dev/stdin /dev/stdout  \
	| samtools sort -o - samtools_csort_tmp  \
	|samtools fillmd -u - reference_genome/human_g1k_v37.fasta > SRR003040.bam
These are the files produced for a single sample:
ls -lh SRR003040*
## -rw-rw-r-- 1 ssfak ssfak 1.3G Jul  3 13:26 SRR003040_1.fastq
## -rw-rw-r-- 1 ssfak ssfak 374M Jul  3 14:39 SRR003040_1.sai
## -rw-rw-r-- 1 ssfak ssfak 1.3G Jul  3 13:27 SRR003040_2.fastq
## -rw-rw-r-- 1 ssfak ssfak 297M Jul  3 14:51 SRR003040_2.sai
## -rw-rw-r-- 1 ssfak ssfak 2.9G Jul  3 16:06 SRR003040.bam
## -rw-rw-r-- 1 ssfak ssfak 3.5G Jul  3 15:43 SRR003040.sam

More information and links


bowtie2 is an
alternative alignment tool that seems to be faster than bwa


A report with benchmarking results (including a 1000genomes "workload") is at
http://www.exascience.com/wp-content/uploads/2013/12/Herzeel-BWAReport.pdf It
has some interesting information on the parallelization of bwa.