Let's say you have two bam files from two different runs you need to merge.
- bam1_batch1.bam
- bam1_batch2.bam
Using samtools view -H <bamfile>
, get @RG information for each of two bams. I usually get ID, DT and PL info. from bam header and then add SM (sample tag) tag using following script. You can run this script before samtools merge
command in your PBS script.
#!/bin/bash
cat > rg.txt << EOF
@RG ID:HWI-ST208:532:C72EYACXX DT:2015-08-21 PL:ILLUMINA PU:C72EYACXX SM:$1
@RG ID:HWI-ST1120:363:C72GNACXX DT:2015-09-29 PL:ILLUMINA PU:C72GNACXX SM:$2
EOF
Run this script:
make_rg.sh bam1_batch1.bam bam1_batch2.bam
This will create file
rg.txt
in a working direcotry, unique for every pair of bams you want to merge.
To merge bam files, run following:
samtools merge -h rg.txt tmp_<merged_bam_name>.bam bam1_batch1.bam bam1_batch2.bam
Following command from Picard tools will add comments @CO tag with steps you did to add @RG above. This steps essentially rewrites entire bam to edit a header. So, larger the bam size, more time it takes to rewrite header info.
java -jar /apps/picard/default/AddCommentsToBam.jar INPUT=tmp_<merged_bam_name>.bam OUTPUT=<merged_bam_name>.bam COMMENT=\"ID:samtools VN:1.2.1 CL: "$1" COMMENT="Picard AddCommentsToBam.jar command was used to add this @CO line and make bam index." COMMENT="Batch 1 bam is from "$2" run; Batch2 bam is from "$3" run" CREATE_INDEX=true CREATE_MD5_FILE=true VALIDATION_STRINGENCY=LENIENT TMP_DIR=/dev/shm
View updated header:
samtools view -H <merged_bam_name>.bam
@RG ID:HWI-ST208:532:C72EYACXX PL:ILLUMINA PU:C72EYACXX DT:2015-08-20T19:00:00-0500 SM:bam1_batch1.bam
@RG ID:HWI-ST1120:363:C72GNACXX PL:ILLUMINA PU:C72GNACXX DT:2015-09-28T19:00:00-0500 SM:bam1_batch2.bam
...
@CO ID:samtools VN:1.2.1 CL: make_rg.sh bam1_batch1.bam bam1_batch2.bam && sleep 5 && samtools merge -h rg.txt tmp_<merged_bam_name>.bam bam1_batch1.bam bam1_batch2.bam
@CO Picard AddCommentsToBam.jar command was used to add this @CO line and make bam index.
@CO Batch 1 bam is from 150821_SN208_0532_AC72EYACXX_27Aug15 run; Batch2 bam is from 150929_SN1120_0363_AC72GNACXX run.
END