First I download the sequence.index
file from the 1000genomes EBI FRP site:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index
This is a tab separated file with the following columns:
head -1 sequence.index | tr '\t' '\n' | nl
## 1 FASTQ_FILE
## 2 MD5
## 3 RUN_ID
## 4 STUDY_ID
## 5 STUDY_NAME
## 6 CENTER_NAME
## 7 SUBMISSION_ID
## 8 SUBMISSION_DATE
## 9 SAMPLE_ID
## 10 SAMPLE_NAME
## 11 POPULATION
## 12 EXPERIMENT_ID
## 13 INSTRUMENT_PLATFORM
## 14 INSTRUMENT_MODEL
## 15 LIBRARY_NAME
## 16 RUN_NAME
## 17 RUN_BLOCK_NAME
## 18 INSERT_SIZE
## 19 LIBRARY_LAYOUT
## 20 PAIRED_FASTQ
## 21 WITHDRAWN
## 22 WITHDRAWN_DATE
## 23 COMMENT
## 24 READ_COUNT
## 25 BASE_COUNT
## 26 ANALYSIS_GROUP
We are going to select only the samples that have been processed by ILLUMINA "instrument platform" are of the "PAIRED" library layout, have not been withdrawn, and are 'low coverage'. Additionally we select only the rows that have both "pairs" -- they have a filename ending in "_1" or "_2" and they have been processed by NCBI (USA, with the "SRR" prefix):
<sequence.index awk -F "\t" -v OFS="," \
'$13=="ILLUMINA" && $19=="PAIRED" && $21=="0" && $26=="low coverage" && \
$1 ~/_[12].filt.fastq.gz$/ &&
$3 ~/^SRR/ {print $1,$2,$3,$9}' \
| sort -k3,4 -t, > illumina.paired.lowcov.csv
head illumina.paired.lowcov.csv | column -t -s,
## data/NA18498/sequence_read/SRR003040_1.filt.fastq.gz 846a12b80f63d8d38896aa4aff173ca8 SRR003040 SRS000095
## data/NA18498/sequence_read/SRR003040_2.filt.fastq.gz 3094a557fecaffe0f3fb4b494bbe798c SRR003040 SRS000095
## data/NA18498/sequence_read/SRR003041_1.filt.fastq.gz 7b7ea75fd54813a7569303a196d0dc6d SRR003041 SRS000095
## data/NA18498/sequence_read/SRR003041_2.filt.fastq.gz 4dc4a2d84b8392357a30567a97d109af SRR003041 SRS000095
## data/NA18498/sequence_read/SRR003042_1.filt.fastq.gz e475d8d5dfc6ddd98c7273ba4e8a818f SRR003042 SRS000095
## data/NA18498/sequence_read/SRR003042_2.filt.fastq.gz 15d983fb9790c707c0702ca9517d458b SRR003042 SRS000095
## data/NA18522/sequence_read/SRR003071_1.filt.fastq.gz 28c18fb753b557da24bc94475c45c937 SRR003071 SRS000109
## data/NA18522/sequence_read/SRR003071_2.filt.fastq.gz 204d840102b45dcded7ec2cee3287b4c SRR003071 SRS000109
## data/NA18522/sequence_read/SRR003072_1.filt.fastq.gz 736d425aa486267c21e1236408dead14 SRR003072 SRS000109
## data/NA18522/sequence_read/SRR003072_2.filt.fastq.gz b75c13ab99056598fedb0cd6e920f3c3 SRR003072 SRS000109
We keep only the file, MD5, "run id", and sample id columns in the illumina.paired.lowcov.csv
output file
How many samples do we have?
cut -f4 -d, illumina.paired.lowcov.csv|sort | uniq -c | wc -l
## 1048
But there are multiple submissions of fastq files for the same sample! So I am going to choose one of those "runs" per sample:
cut -f3,4 -d, illumina.paired.lowcov.csv \
| tr ',' ' '\
| sort -k 2 \
| uniq -f 1 \
| tr ' ' ',' > selected_runs.csv
head selected_runs.csv | column -t -s,
## SRR400038 SRS000030
## SRR003667 SRS000031
## SRR385751 SRS000032
## SRR003539 SRS000035
## SRR006016 SRS000037
## SRR211280 SRS000039
## SRR006171 SRS000041
## SRR006092 SRS000042
## SRR006118 SRS000043
## SRR023299 SRS000044
Merge (join) the two files to filter and choose one the runids for each sample:
join -1 2 <(sed 's/,/ /2' illumina.paired.lowcov.csv) <(sort selected_runs.csv) \
| tr ' ' , > sel.csv
head sel.csv | column -t -s,
## SRR003040 SRS000095 data/NA18498/sequence_read/SRR003040_1.filt.fastq.gz 846a12b80f63d8d38896aa4aff173ca8
## SRR003040 SRS000095 data/NA18498/sequence_read/SRR003040_2.filt.fastq.gz 3094a557fecaffe0f3fb4b494bbe798c
## SRR003071 SRS000109 data/NA18522/sequence_read/SRR003071_1.filt.fastq.gz 28c18fb753b557da24bc94475c45c937
## SRR003071 SRS000109 data/NA18522/sequence_read/SRR003071_2.filt.fastq.gz 204d840102b45dcded7ec2cee3287b4c
## SRR003107 SRS000142 data/NA18856/sequence_read/SRR003107_1.filt.fastq.gz 3972ce3099fb30d4d265b3a07eb251fe
## SRR003107 SRS000142 data/NA18856/sequence_read/SRR003107_2.filt.fastq.gz 36145f73fb00720dd0ecb0eddedd2acd
## SRR003114 SRS000050 data/NA11920/sequence_read/SRR003114_1.filt.fastq.gz 6c06635fc52da0f1c71a4c53d91382ca
## SRR003114 SRS000050 data/NA11920/sequence_read/SRR003114_2.filt.fastq.gz c47518c16c8dbd31014cb5141aeb89b5
## SRR003121 SRS000065 data/NA12155/sequence_read/SRR003121_1.filt.fastq.gz 79ddf87d07af42ddb2ab26444145658f
## SRR003121 SRS000065 data/NA12155/sequence_read/SRR003121_2.filt.fastq.gz 28b25157e29d1e89493b19d8d9468dd3
And finally construct a file with the "_1" and "_2" files for each selected sample and run id:
paste - - <sel.csv | tr '\t' ',' | cut -f5,6 --complement -d, > final_selected.csv
head final_selected.csv | column -t -s,
## SRR003040 SRS000095 data/NA18498/sequence_read/SRR003040_1.filt.fastq.gz 846a12b80f63d8d38896aa4aff173ca8 data/NA18498/sequence_read/SRR003040_2.filt.fastq.gz 3094a557fecaffe0f3fb4b494bbe798c
## SRR003071 SRS000109 data/NA18522/sequence_read/SRR003071_1.filt.fastq.gz 28c18fb753b557da24bc94475c45c937 data/NA18522/sequence_read/SRR003071_2.filt.fastq.gz 204d840102b45dcded7ec2cee3287b4c
## SRR003107 SRS000142 data/NA18856/sequence_read/SRR003107_1.filt.fastq.gz 3972ce3099fb30d4d265b3a07eb251fe data/NA18856/sequence_read/SRR003107_2.filt.fastq.gz 36145f73fb00720dd0ecb0eddedd2acd
## SRR003114 SRS000050 data/NA11920/sequence_read/SRR003114_1.filt.fastq.gz 6c06635fc52da0f1c71a4c53d91382ca data/NA11920/sequence_read/SRR003114_2.filt.fastq.gz c47518c16c8dbd31014cb5141aeb89b5
## SRR003121 SRS000065 data/NA12155/sequence_read/SRR003121_1.filt.fastq.gz 79ddf87d07af42ddb2ab26444145658f data/NA12155/sequence_read/SRR003121_2.filt.fastq.gz 28b25157e29d1e89493b19d8d9468dd3
## SRR003128 SRS000056 data/NA12003/sequence_read/SRR003128_1.filt.fastq.gz 1a6c708f72579890d18e515de9f9b1ed data/NA12003/sequence_read/SRR003128_2.filt.fastq.gz 207be25d385d5657f60e2460fa5c8692
## SRR003142 SRS000052 data/NA11992/sequence_read/SRR003142_1.filt.fastq.gz 812eaa35e21eada5f53200c96da41d9f data/NA11992/sequence_read/SRR003142_2.filt.fastq.gz 696445a5798c5dcad8e41e10bdbd355e
## SRR003231 SRS000181 data/NA19093/sequence_read/SRR003231_1.filt.fastq.gz c8458d8d381a74da5de159abdcefc232 data/NA19093/sequence_read/SRR003231_2.filt.fastq.gz 58144d5b105020e18ab6630cbdec3cd1
## SRR003258 SRS000094 data/NA18489/sequence_read/SRR003258_1.filt.fastq.gz cfe616bc5792362e42b8880d4abd5cbb data/NA18489/sequence_read/SRR003258_2.filt.fastq.gz cd712d7c388e8cf2665d32d681fbfb0d
## SRR003293 SRS000184 data/NA19102/sequence_read/SRR003293_1.filt.fastq.gz 6a20235666770460a52fd0ecea440aac data/NA19102/sequence_read/SRR003293_2.filt.fastq.gz 01357e6f209dd373e8d28351d608e0de
NOTE:The process shown here is the one proposed by the 1000genomes and can be found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.alignment_data in section B
Prerequisites:
Firstly we need the human reference genome that can be downloaded from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz and unzipped. Then I used the bwa tool to index it:
bwa index reference_genome/human_g1k_v37.fasta
The indexing process took Real time: 6729.003 sec; CPU: 6723.132 sec
Let's see the process for one sample. I am going to take the first line of the final_selected.csv
and download the relevant files. The commands are:
head -1 final_selected.csv \
| awk -F, -v OFS="" 'BEGIN {server = "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/"}
{
print "wget ", server, $3, " -O- | gunzip >", $1, "_1.fastq";
print "wget ", server, $5, " -O- | gunzip >", $1, "_2.fastq";
}'
## wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA18498/sequence_read/SRR003040_1.filt.fastq.gz -O- | gunzip >SRR003040_1.fastq
## wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA18498/sequence_read/SRR003040_2.filt.fastq.gz -O- | gunzip >SRR003040_2.fastq
(Normally we should check the MD5 checksum but anyway...)
Then each of the files need to be aligned with the ref. genome as follows:
bwa aln -t 8 -q 15 -f SRR003040_1.sai reference_genome/human_g1k_v37.fasta SRR003040_1.fastq
bwa aln -t 8 -q 15 -f SRR003040_2.sai reference_genome/human_g1k_v37.fasta SRR003040_2.fastq
(The -t <num>
option provides the number of threads to use, see the
manual)
In a typical run the (each of the) above command took Real time: 543.554 sec; CPU: 3659.933 sec
Then we build a SAM file combining the "_1" and "_2" aligned pairs:
bwa sampe -f SRR003040.sam reference_genome/human_g1k_v37.fasta SRR003040_*.sai SRR003040_*.fastq
(Note: what about the "max insert size" (option -a <num>
)? this "should be
taken to be 3 times the expected insert size" but what's the expected size??
In any case if not given bwa
uses the default 500 and "this option is only
used when there are not enough good alignment to infer the distribution of
insert sizes.")
When I run it bwa
reported "Real time: 2764.259 sec; CPU: 2762.281 sec"
The final step is to construct the BAM file from SAM, sort, fix mate information and add the MD tag. This is done using SAMtools as follows:
samtools view -bSu SRR003040.sam \
| samtools sort -n -o - samtools_nsort_tmp \
| samtools fixmate /dev/stdin /dev/stdout \
| samtools sort -o - samtools_csort_tmp \
|samtools fillmd -u - reference_genome/human_g1k_v37.fasta > SRR003040.bam
These are the files produced for a single sample:
ls -lh SRR003040*
## -rw-rw-r-- 1 ssfak ssfak 1.3G Jul 3 13:26 SRR003040_1.fastq
## -rw-rw-r-- 1 ssfak ssfak 374M Jul 3 14:39 SRR003040_1.sai
## -rw-rw-r-- 1 ssfak ssfak 1.3G Jul 3 13:27 SRR003040_2.fastq
## -rw-rw-r-- 1 ssfak ssfak 297M Jul 3 14:51 SRR003040_2.sai
## -rw-rw-r-- 1 ssfak ssfak 2.9G Jul 3 16:06 SRR003040.bam
## -rw-rw-r-- 1 ssfak ssfak 3.5G Jul 3 15:43 SRR003040.sam
-
bowtie2 is an alternative alignment tool that seems to be faster than
bwa
-
A report with benchmarking results (including a 1000genomes "workload") is at http://www.exascience.com/wp-content/uploads/2013/12/Herzeel-BWAReport.pdf It has some interesting information on the parallelization of
bwa
.