Skip to content

Instantly share code, notes, and snippets.

@iansealy
Created April 1, 2016 09:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save iansealy/63dd4ac86bcc980399743368fe1fa137 to your computer and use it in GitHub Desktop.
Save iansealy/63dd4ac86bcc980399743368fe1fa137 to your computer and use it in GitHub Desktop.
Merge 4 DETCT lanes to 1

Merge 4 DETCT lanes to 1

Set environment variables representing the four lanes. For example:

run_lane1=19231_1; run_lane2=19231_2; run_lane3=19244_1; run_lane4=19244_2

Extract all the samples and tags:

run1=`echo $run_lane1 | sed -e 's/_.*//'`
lane1=`echo $run_lane1 | sed -e 's/.*_//'`
run2=`echo $run_lane2 | sed -e 's/_.*//'`
lane2=`echo $run_lane2 | sed -e 's/.*_//'`
run3=`echo $run_lane3 | sed -e 's/_.*//'`
lane3=`echo $run_lane3 | sed -e 's/.*_//'`
run4=`echo $run_lane4 | sed -e 's/_.*//'`
lane4=`echo $run_lane4 | sed -e 's/.*_//'`
mysql -h seqw-db -P 3379 -u warehouse_ro sequencescape_warehouse -Bse \
"SELECT DISTINCT supplier_name, tag_index, tag_sequence FROM npg_plex_information, current_samples
WHERE npg_plex_information.sample_id = current_samples.internal_id
AND ((id_run = $run1 AND position = $lane1) OR (id_run = $run2 AND position = $lane2) OR (id_run = $run3 AND position = $lane3) OR (id_run = $run4 AND position = $lane4))
AND tag_index <> 0 AND tag_index <> 168 AND tag_index <> 888
ORDER BY tag_index" \
| awk '{ print $3, $1 }' | sort -u \
| sed -e 's/_[a-zA-Z]*[0-9]*$//' \
| awk '{ print $2 ":" $1 }' | sort -u > $run_lane1/samples.txt
cp $run_lane1/samples.txt $run_lane2/samples.txt
cp $run_lane1/samples.txt $run_lane3/samples.txt
cp $run_lane1/samples.txt $run_lane4/samples.txt

View $run_lane1/samples.txt and check the first column only contains one name.

Merge:

dest1=`sed -e 's/:.*//' $run_lane1/samples.txt | sort -u | head -1`
sed -e 's/:.*//' $run_lane1/samples.txt | sort -u | xargs mkdir
for pair in `cat $run_lane1/samples.txt`
do
dest=`echo $pair | sed -e 's/:.*//'`
tag=`echo $pair | sed -e 's/.*://'`
filename=`ls $run_lane1/*NNNN${tag}*.bam $run_lane2/*NNNN${tag}*.bam $run_lane3/*NNNN${tag}*.bam $run_lane4/*NNNN${tag}*.bam  | grep -v tr.bam$ | head -1 | sed -e 's/.*\///'`
bsub -o $dest/merge.$tag.o -e $dest/merge.$tag.e \
-R'select[mem>4000] rusage[mem=4000]' -M4000 \
java -XX:ParallelGCThreads=1 -Xmx4g -jar /software/team31/packages/picard-tools/MergeSamFiles.jar \
INPUT=`ls $run_lane1/*NNNN${tag}*.bam $run_lane2/*NNNN${tag}*.bam $run_lane3/*NNNN${tag}*.bam $run_lane4/*NNNN${tag}*.bam  | grep -v tr.bam$ | sort | tr '\n' ' ' | sed -e 's/ $//' | sed -e 's/ / INPUT=/g'` \
OUTPUT=$dest/$filename \
MSD=true ASSUME_SORTED=false \
VALIDATION_STRINGENCY=SILENT VERBOSITY=WARNING QUIET=true \
TMP_DIR=$dest
filename=`ls $run_lane1/*NNNN${tag}*.bam $run_lane2/*NNNN${tag}*.bam $run_lane3/*NNNN${tag}*.bam $run_lane4/*NNNN${tag}*.bam | grep tr.bam$ | head -1 | sed -e 's/.*\///'`
bsub -o $dest/tr.merge.$tag.o -e $dest/tr.merge.$tag.e \
-R'select[mem>4000] rusage[mem=4000]' -M4000 \
java -XX:ParallelGCThreads=1 -Xmx4g -jar /software/team31/packages/picard-tools/MergeSamFiles.jar \
INPUT=`ls $run_lane1/*NNNN${tag}*.bam $run_lane2/*NNNN${tag}*.bam $run_lane3/*NNNN${tag}*.bam $run_lane4/*NNNN${tag}*.bam | grep tr.bam$ | sort | tr '\n' ' ' | sed -e 's/ $//' | sed -e 's/ / INPUT=/g'` \
OUTPUT=$dest/$filename \
MSD=true ASSUME_SORTED=false \
VALIDATION_STRINGENCY=SILENT VERBOSITY=WARNING QUIET=true \
TMP_DIR=$dest
done

Check jobs ran OK:

echo `wc -l $run_lane1/samples.txt | awk '{ print $1 }'` \* 2 | bc && \
grep -l 'Successfully completed' $dest1/*merge.*.o | sort -u | wc -l && \
grep -l 'Exited' $dest1/*merge.*.o | sort -u | wc -l

Set new environment variable for the merged experiment:

run_lane=`sed -e 's/:.*//' $run_lane1/samples.txt | sort -u | head -1`

Copy useful files from original directories:

cp $run_lane1/*.imeta $run_lane2/*.imeta $run_lane3/*.imeta $run_lane4/*.imeta $run_lane1/*.stats $run_lane2/*.stats $run_lane3/*.stats $run_lane4/*.stats $run_lane1/samples.txt $dest1

Delete original directories:

rm -rf $run_lane1 $run_lane2 $run_lane3 $run_lane4

Then continue from the "Index merged bam files" section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment