Skip to content

Instantly share code, notes, and snippets.

@SamStudio8

SamStudio8/awk

Last active Aug 29, 2015
Embed
What would you like to do?
Benchmarking reads over a pair of 42GB FASTQ files (~1.5 billion lines)
# awk 3.1.7
# 1hr 1m
awk '$1 ~ /@/ {++c} END {print c}' $1
# 1hr 1m
awk '/^@/ {c++} END {print c}' $1
# 55m
awk '{if (substr($0,0,1) == "@") { ++c }} END {print c}' $1
# 9m 58s
LC_ALL=C awk '/^@/ {++c} END {print c}' $1
# python 2.6.6
# 15m
while line:
if line[0] == '@':
count += 1
line = fastq_1_fh.readline()
print count
@iiSeymour

This comment has been minimized.

Copy link

@iiSeymour iiSeymour commented Oct 19, 2014

Anchoring the regular expression will definitely make a difference and drop the explicit field comparison:

$ awk '/^@/ {++c} END {print c}' $1

I'd like to see a bench mark with the latest version of gawk 4.1.0 and try mawk, it's not as fully featured as gawk but is usually significantly faster.

If you're not taking advantage of the fact awk is doing field splitting then maybe just:

$ grep '^@' $1 | wc -l

Or swap out grep for https://github.com/ggreer/the_silver_searcher

Your locale can also play a part in these things. If it's set to UTF-8 rerun with LC_ALL=c.

@SamStudio8

This comment has been minimized.

Copy link
Owner Author

@SamStudio8 SamStudio8 commented Oct 19, 2014

Running the anchored version now, grep -c works considerably faster and I had already used it for getting the known count of what I was looking for. I was then interested in getting an idea of whether Python was "slow" at reading in the 780 million lines of the file, or whether it did a reasonable job of keeping up. I guess AWK probably isn't the right thing to be benchmarking it with (but I thought it might be a relatively nice approximation to C)...

Interesting thought about the locale, what could it be interfering with?

@SamStudio8

This comment has been minimized.

Copy link
Owner Author

@SamStudio8 SamStudio8 commented Oct 19, 2014

I'll see if I can get the version of awk on the server updated (mawk doesn't appear to be installed), or whether it can otherwise be made available. The server is a cluster at the university so is presumably on some LTS.

@iiSeymour

This comment has been minimized.

Copy link

@iiSeymour iiSeymour commented Oct 19, 2014

Why locale matters http://www.inmotionhosting.com/support/website/ssh/speed-up-grep-searches-with-lc-all

Just build from source and run from ~/bin or does the job not run on same node?

@SamStudio8

This comment has been minimized.

Copy link
Owner Author

@SamStudio8 SamStudio8 commented Oct 19, 2014

@iiSeymour Doesn't run on the same node, can't seem to request a particular node either.
Thanks for the link, I feel like that was something I should have already known!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.