Skip to content

Instantly share code, notes, and snippets.

@srividya22
Forked from jdblischak/README.md
Created May 29, 2018 22:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save srividya22/467f18117d00ca55aa455d8083e22219 to your computer and use it in GitHub Desktop.
Save srividya22/467f18117d00ca55aa455d8083e22219 to your computer and use it in GitHub Desktop.
snakemake_vmem_usage

Testing Snakemake virtual memory usage

John Blischak 2014-05-14

Multiple users have observed that submitting jobs via Snakemake requires much more memory than is necessary to run the command (e.g. mailing list post, Bitbucket issue).

To document and explore this issue, I have created a few scripts to recreate the problem. The task to perform is simply sleeping for a short, random period of time and then creating a file. For comparison, I run this 100 times using three methods:

  • shell - sumbit the job directly via the Bash shell
  • subprocess - submit the job to qsub via the Python subprocess module
  • snakemake - submit the job via Snakemake

Example usage:

The analysis can be run using the commands below. The argument passed to the scripts is an ID number for that particular run. The first script should be run on the head node from within this directory.

bash submit_all.sh 01
bash check_and_clean.sh 01
bash analyze.sh 01

Results

The virtual memory usage is similar (and minimal) when using the shell or subprocess. However, the virutal memory is increased by an order of magnitude when using Snakemake. Furthermore, the virtual memory requirements for Snakemake also fluctuate by over 1G, which is more than the total virtual memory requirement for the other two methods.

The commands run through shell had the following distribution of virtual memory usage:

  • 14 119.848M
  • 83 119.910M
  • 3 119.926M

The commands run through subprocess had the following distribution of virtual memory usage:

  • 9 119.848M
  • 88 119.910M
  • 3 119.926M

The commands run through snakemake had the following distribution of virtual memory usage:

  • 82 1.440G
  • 1 1.441G
  • 2 2.558G
  • 15 2.559G

I used the following software versions:

  • Red Hat Enterprise Linux Server release 5.4
  • Sun Grid Engine 6.2u3
  • Python 3.3.4
  • Snakemake 2.5.1

File descriptions

Files:

  • analyze.sh - Retrieves the virtual memory usage via qacct
  • check_and_clean.sh - Reports if the jobs finished and removes all files
  • submit_all.sh - runs all three methods (run from the head node)
  • Snakefile - creates file via Snakemake
  • submit.py - creates file via subprocess
  • submit.sh - creates file via shell
#!/bin/bash
# Summarizes the virtual memory use by querying qacct.
ID=$1
for name in shell subprocess snakemake
do
qacct -j job.$name.$ID.* > qacct.$name.$ID.tmp.txt
echo "The commands run through $name had the following distribution of virtual memory usage:"
cat qacct.$name.$ID.tmp.txt | grep maxvmem | awk '{print $2}' | sort | uniq -c
echo "The commands run through $name had the following distribution of resident set size:"
cat qacct.$name.$ID.tmp.txt | grep ru_maxrss | awk '{print $2}' | sort | uniq -c
rm qacct.$name.$ID.tmp.txt
done
#!/bin/bash
# Reports whether the jobs finished or not.
# Removes all created files.
ID=$1
shell_success=`ls shell.$ID.[0-9]* 2> /dev/null | wc -l`
shell_total=`ls job.shell.$ID.[0-9]* 2> /dev/null | wc -l`
echo "$shell_success of the $shell_total shell jobs completed."
if [ $shell_success -gt 0 ]
then
rm shell.$ID.[0-9]*
fi
if [ $shell_total -gt 0 ]
then
rm job.shell.$ID.[0-9]*
fi
subprocess_success=`ls subprocess.$ID.[0-9]* 2> /dev/null | wc -l`
subprocess_total=`ls job.subprocess.$ID.[0-9]* 2> /dev/null | wc -l`
echo "$subprocess_success of the $subprocess_total subprocess jobs completed."
if [ $subprocess_success -gt 0 ]
then
rm subprocess.$ID.[0-9]*
fi
if [ $subprocess_total -gt 0 ]
then
rm job.subprocess.$ID.[0-9]*
fi
snake_success=`ls snakemake.$ID.[0-9]* 2> /dev/null | wc -l`
snake_total=`ls job.snakemake.$ID.[0-9]* 2> /dev/null | wc -l`
echo "$snake_success of the $snake_total Snakemake jobs completed."
if [ $snake_success -gt 0 ]
then
rm snakemake.$ID.[0-9]*
fi
if [ $snake_total -gt 0 ]
then
rm job.snakemake.$ID.[0-9]* snake.ID
fi
if [ -a snake.ID ]
then
rm snake.ID
fi
#!/bin/sh
# properties = {properties}
ulimit -v 200000
{exec_job}
exit 0
'''
Submits 100 jobs via Snakemake.
Referred to as method "snakemake".
'''
ID = '01'
configfile: 'snake.ID'
localrules: all
rule all:
input: ['snakemake.%s.%d'%(config['ID'], i) for i in range(1, 101)]
rule touch:
output: 'snakemake.%s.{i}'%(config['ID'])
shell: 'sleep $((${{RANDOM:0:1}} * 10))s; touch {output}'
'''
Submits 100 jobs using the Python subprocess module.
Referred to as method "subprocess".
'''
import sys
import os
import subprocess as sp
ID = sys.argv[1]
for i in range(1, 101):
cmd = 'echo "sleep $((${RANDOM:0:1} * 10))s; \
touch subprocess.%s.%d" | qsub -l h_vmem=4g \
-V -N job.subprocess.%s.%d -cwd -j y'%(ID, i, ID, i)
sp.Popen(cmd, shell = True, executable = os.environ['SHELL'])
#!/bin/bash
# Submits 100 jobs by echoing the command directly to qsub.
# Referred to as method "shell".
ID=$1
for i in {1..100}
do
echo "sleep $((${RANDOM:0:1} * 10))s; touch shell.$ID.$i" | qsub -l h_vmem=4g -V -N job.shell.$ID.$i -cwd -j y
done
#!/bin/bash
# Wrapper script for submitting jobs via the three methods.
# Should be run on the head node in the same working directory
# as the other scripts.
ID=$1
# Run jobs with shell
bash submit.sh $ID
# Run jobs with subprocess
python submit.py $ID
# Run jobs with Snakemake
echo -e "{\n\"ID\" : \"$ID\"\n}" > snake.ID
snakemake --js jobscript.sh -c "qsub -l h_vmem=4g -V -cwd -N job.{output} -j y" -j 100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment