-
-
Save mbland/02f87432ce3b3a3860144e0e18c6b578 to your computer and use it in GitHub Desktop.
#! /bin/bash | |
# | |
# Driver for the count-matching-lines script | |
# | |
# Author: Mike Bland <mbland@acm.org> | |
# https://mike-bland.com/, https://github.com/mbland/ | |
# Date: 2016-08-19 | |
usage() { | |
local message | |
read -r -d '' message <<END_USAGE | |
Driver for the count-matching-lines script | |
Usage: | |
${0##*/} <output directory> <pattern> file [file...] | |
This script will run every combination of flags possible for the | |
count-matching-lines script based on the number of input files and the version | |
of bash used to run this script. This means: | |
- If there is more than one input file, both the one-file-at-a-time and | |
concatenated files input methods will run. If there's only one input file, | |
there will not be a concatenated files run. | |
- If the bash version is 4.x and above, the bash 3.x implementation will run, | |
followed by all bash 4.x implementations. If the bash version is below 4.x, | |
only the bash 3.x implementation will run. | |
Each run will produce an output file in <output directory>, which will be | |
created if it doesn't exist, but must be empty if it does. When all runs have | |
completed, a "summary.txt" file will be generated in <output directory> and | |
output to the terminal. | |
If 'memusg' (https://gist.github.com/netj/526585) is in your \$PATH, the results | |
will contain the peak memory usage for each implementation. | |
END_USAGE | |
echo "$message" | |
} | |
do_run() { | |
local impl="$1" | |
local outfile="$OUTPUT_DIR/" | |
local memusg="$(command -v 'memusg')" | |
# We need to ensure count-matching-lines runs with whatever version of bash | |
# was used to launch this driver rather than relying on $PATH to find bash. | |
local run_bash=("$BASH") | |
local args=() | |
if [[ -n $memusg ]]; then | |
run_bash=("$memusg" "$BASH") | |
fi | |
case "$impl" in | |
grep|bashv3|perl|python) | |
outfile+="$impl" | |
args+=("--$impl") | |
;; | |
'') | |
impl="bashv${BASH_VERSINFO[0]}" | |
outfile+="$impl" | |
;; | |
*) | |
outfile+="bashv${BASH_VERSINFO[0]}-$impl" | |
args+=("--$impl") | |
;; | |
esac | |
args+=("${ARGV[@]}") | |
echo "Running $impl" | |
# We have to enable job control because of memusg. | |
# | |
# This will be easier to understand after playing with a working example. | |
# First create the following script; I'll call it "foo.sh": | |
# | |
# memusg ls >"foo0.out" 2>&1 | |
# memusg ls >"foo1.out" 2>&1 | |
# | |
# Then: | |
# | |
# - Run it as 'bash foo.sh' and it will hang; run 'fg' to continue. | |
# - Run it as 'bash -m foo.sh' and it will complete. | |
# | |
# Also try editing the output redirections per the following and running it | |
# with 'bash foo.sh': | |
# | |
# - Remove the first "2>&1" redirection and it will hang. | |
# - Remove the second "2>&1" redirection and it will complete. | |
# - Remove both "2>&1" redirection and it will complete. | |
# - Remove the first >"foo0.out" redirection and it will hang. | |
# - Remove the second >"foo0.out" redirection and it will complete. | |
# | |
# This is what's happening to the best of my current knowledge: | |
# | |
# When it detects that it is running in the same process group as its parent, | |
# memusg will try to launch its command arguments in a new interactive shell | |
# ('exec bash -i -c "$cmd"', which seems contradictory) to make sure that the | |
# command it's measuring is in its own process group. It uses this bash | |
# invocation as a more portable alternative to the 'setsid' command, which is | |
# available on Linux but not OS X, for example. See: | |
# | |
# https://gist.github.com/netj/526585#gistcomment-11943 | |
# | |
# This normally works fine on the command line; it even works fine in a script | |
# when the command's standard error is not redirected, or when it's redirected | |
# only for the first 'memusg' invocation. In these cases, standard input will | |
# redirect to a file just fine and memusg's background process will print | |
# "memusg=: peak=..." either to the terminal when standard error isn't | |
# redirected, or to the file when it is. | |
# | |
# However, on the second 'memusg' invocation with standard error redirected, | |
# this new shell will somehow try to read from the terminal that launched the | |
# script. When that happens, the script will receive a SIGTTIN signal, which | |
# will push it into the background. (See the "JOB CONTROL" section of the bash | |
# man page.) At that point, the user must repeatedly bring the process back | |
# into the foreground manually with 'fg' until the final 'memusg' invocation | |
# completes. | |
# | |
# "set -m" enables "Monitor mode", i.e. job control. Normally it only makes | |
# sense for interactive shells, but the side effect we need is that all | |
# processes are then started in a separate process group. This keeps the 'exec | |
# bash -i -c "$cmd"' invocation from happening, which in turn prevents memusg | |
# from suspending the script. I validated this by adding the following to | |
# memusg before the 'exec bash -i -c "$cmd"' call: | |
# | |
# echo "EXECING FOR NEW PROCESS GROUP" >&2 | |
# | |
# Then running the script with: | |
# | |
# - no job control and no standard error redirection: | |
# message appears, process completes | |
# - no job control and with standard error redirection: | |
# message appears, process suspends | |
# - with job control and no standard error redirection | |
# message doesn't appear, process completes | |
# - with job control and with standard error redirection | |
# message doesn't appear, process completes | |
# | |
# The following reference helped me zero in on the need for "set -m": | |
# | |
# https://stackoverflow.com/questions/690266/why-cant-i-use-job-control-in-a-bash-script/690297#690297 | |
# (or https://stackoverflow.com/a/690297, but that redirects to http://.) | |
# | |
# Why this only happens starting with the second 'memusg' invocation with | |
# standard error redirected to a file is still a mystery. If I find out, I'll | |
# update this comment. | |
set -m | |
"${run_bash[@]}" "$COUNT_MATCHING_LINES" "${args[@]}" >"$outfile.txt" 2>&1 | |
if [[ $DO_CAT = 'true' ]]; then | |
echo "Running $impl with concatenated files" | |
"${run_bash[@]}" "$COUNT_MATCHING_LINES" --cat "${args[@]}" \ | |
>"$outfile-cat.txt" 2>&1 | |
fi | |
set +m | |
} | |
summarize_timing() { | |
printf "Timing:\n" | |
grep 'real' "$OUTPUT_DIR"/* | sort -k 2 | | |
sed -e "s#^$OUTPUT_DIR/##" -e 's/\.txt:real*//' | | |
awk '{printf " %s %s\n", $2, $1}' | |
} | |
wait_for_memusg() { | |
while ps -e -o command | grep -q "[m]emusg $BASH $COUNT_MATCHING_LINES"; do | |
sleep 0.1 | |
done | |
} | |
summarize_memusg() { | |
wait_for_memusg | |
if ! grep -q 'memusg' "$OUTPUT_DIR"/*; then | |
return | |
fi | |
printf "\nMemory usage:\n" | |
grep 'memusg' "$OUTPUT_DIR"/* | | |
sed -e "s#^$OUTPUT_DIR/##" -e 's/.txt:memusg: peak=/ /' | sort -n -k 2 | | |
awk '{printf " %s %s\n", $2, $1}' | |
} | |
summarize_file_stats() { | |
printf "\nInput file stats:\n" | |
egrep "^(Total|Average) " $OUTPUT_DIR/grep.txt | awk '{ printf " %s\n", $0 }' | |
} | |
summarize() { | |
local summary_file="$OUTPUT_DIR/summary.txt" | |
summarize_timing > "$summary_file" | |
summarize_memusg >> "$summary_file" | |
summarize_file_stats >> "$summary_file" | |
cat "$summary_file" | |
printf "\nAll output saved in: %s\nSummary saved in: %s\n\n" \ | |
"$OUTPUT_DIR" "$summary_file" | |
} | |
count_matching_lines_driver() { | |
if [[ $# -eq 0 ]]; then | |
usage | |
return | |
fi | |
COUNT_MATCHING_LINES="${0%/*}/count-matching-lines" | |
if [[ ! -f $COUNT_MATCHING_LINES ]]; then | |
echo "count-matching-lines script should be in the same directory" >&2 | |
exit 1 | |
fi | |
OUTPUT_DIR="$1" | |
shift | |
if [[ ! -d $OUTPUT_DIR ]] && ! mkdir "$OUTPUT_DIR"; then | |
echo "Could not create output directory: $OUTPUT_DIR" >&2 | |
exit 1 | |
elif ! rmdir "$OUTPUT_DIR"; then | |
echo "Remove all files from $OUTPUT_DIR and try again." >&2 | |
exit | |
fi | |
mkdir "$OUTPUT_DIR" | |
# If there is more than one filename specified, also run with concatenated | |
# input. | |
if [[ $# -gt 2 ]]; then | |
DO_CAT='true' | |
fi | |
ARGV=("$@") | |
do_run 'grep' | |
if [[ ${BASH_VERSINFO[0]} -ge 4 ]]; then | |
do_run 'bashv3' | |
fi | |
do_run '' | |
if [[ ${BASH_VERSINFO[0]} -ge 4 ]]; then | |
do_run 'mapfile' | |
do_run 'callback' | |
fi | |
local languages=('perl' 'python') | |
for language in "${languages[@]}"; do | |
if command -v "$language" >/dev/null; then | |
do_run "$language" | |
fi | |
done | |
printf "Done\n\n" | |
summarize | |
} | |
count_matching_lines_driver "$@" |
#! /bin/bash | |
# | |
# Proof-of-concept bash-only implementation of 'grep -o $exp | sort | uniq -c' | |
# | |
# When processing a single, relatively small file (around 200 lines), the | |
# performance of both the grep and bash-only implementations is fairly close. | |
# When processing multiple relatively small files, the bash-only version begins | |
# to pull ahead slightly, as it does not create a new process pipeline for each | |
# file. | |
# | |
# Beyond two hundred lines of input, or when concatenating multiple files into | |
# standard input, the grep pipeline is significantly faster. The advantage of | |
# the bash version becomes a potentially smaller memory footprint (since it | |
# doesn't launch new processes and only one copy of each matching string is | |
# stored in memory at any time) and portability to platforms that lack grep, | |
# sort, or uniq (unlikely if bash is installed, but possible). | |
# | |
# The bash 4.x version using an associative array significantly outperforms the | |
# 3.x version, and uses even less memory. Using 'mapfile' in Bash 4.x instead of | |
# 'while read -r' didn't significantly impact the performance on small files. On | |
# very large files, the 'mapfile' version | |
# | |
# Author: Mike Bland <mbland@acm.org> | |
# https://mike-bland.com/, https://github.com/mbland/ | |
# Date: 2016-08-18 | |
usage() { | |
local message | |
read -r -d '' message <<END_USAGE | |
Proof-of-concept bash-only implementation of 'grep -o \$exp | sort | uniq -c' | |
Usage: | |
${0##*/} [--<flag>] <pattern> file [file...] | |
Flags: | |
--cat Concatenate all input files, rather than looping over them | |
--grep Use the 'grep -o <pattern> | sort | uniq -c' implementation | |
--mapfile In bash 4.x, replace the 'while read' loop with 'mapfile' | |
--callback In bash 4.x, use a callback with 'mapfile' instead of looping | |
--bashv3 Use the bash 3.x implementation, even when running under 4.x | |
--perl Use the perl implementation | |
--python Use the python implementation | |
The default behavior is to loop over the files one-at-a-time, process them using | |
the bash-only implementation matching the shell version, then output the | |
matching lines and their frequency counts for each file. | |
The first line of output will indicate the implementation used, whether the | |
files were provided one-at-a-time or concatenated, and the number of lines and | |
bytes per file. | |
The process is then executed, the time reported, and the number of files, lines, | |
and bytes summarized. | |
NOTE ON BASH 4.X, PERL, AND PYTHON IMPLEMENTATIONS: The output will not | |
necessarily be sorted by matching string as with the other implementations, as | |
they use an associative array. The output from these implementations could then | |
be sorted with 'sort -k 2'. | |
END_USAGE | |
echo "$message" | |
} | |
using_grep() { | |
grep -o "$1" | sort | uniq -c | |
} | |
# Uses the '=~' operator and ${BASH_REMATCH[0]} to replace 'grep -o'. | |
# Uses an insertion sort to replace 'sort | uniq -c'. | |
using_bash_v3() { | |
local pattern="$1" | |
local matches=() | |
local counts=() | |
local line | |
local match | |
local i | |
local j | |
local current | |
while read -r line; do | |
if [[ ! $line =~ $pattern ]]; then | |
continue | |
fi | |
match="${BASH_REMATCH[0]}" | |
for ((i=${#matches[@]}; i != 0; --i)); do | |
current="${matches[$((i-1))]}" | |
if [[ $match = $current ]]; then | |
((--i)) | |
break | |
elif [[ $match > $current ]]; then | |
break | |
fi | |
done | |
if [[ ${#matches[@]} -ne 0 && $match < ${matches[$i]} ]]; then | |
for ((j=${#matches[@]}; j != i; --j)); do | |
matches[$j]="${matches[$((j-1))]}" | |
((counts[$j]=counts[j-1])) | |
done | |
((counts[i]=0)) | |
fi | |
matches[$i]="$match" | |
((++counts[i])) | |
done | |
for ((i=0; i != "${#matches[@]}"; ++i)); do | |
printf "%4d %s\n" "${counts[$i]}" "${matches[$i]}" | |
done | |
} | |
# Uses the '=~' operator and ${BASH_REMATCH[0]} to replace 'grep -o'. | |
# Uses an associative array to replace 'sort | uniq -c'. | |
# The output lines aren't sorted by the match strings, but the output can be | |
# piped into 'sort -k 2' if that matters. | |
using_bash_v4() { | |
local pattern="$1" | |
local -A matches=() | |
local line | |
while read -r line; do | |
if [[ $line =~ $pattern ]]; then | |
((++matches["${BASH_REMATCH[0]}"])) | |
fi | |
done | |
for match in "${!matches[@]}"; do | |
printf "%4d %s\n" "${matches[$match]}" "$match" | |
done | |
} | |
# Same as above, but using mapfile instead of 'while read' | |
using_bash_v4_with_mapfile() { | |
local pattern="$1" | |
local -A matches=() | |
local line | |
local lines=() | |
mapfile lines | |
for line in "${lines[@]}"; do | |
if [[ $line =~ $pattern ]]; then | |
((++matches["${BASH_REMATCH[0]}"])) | |
fi | |
done | |
for match in "${!matches[@]}"; do | |
printf "%4d %s\n" "${matches[$match]}" "$match" | |
done | |
} | |
# Same as above, but using a callback with 'mapfile' instead of looping. | |
using_bash_v4_with_mapfile_callback() { | |
local pattern="$1" | |
local -A matches=() | |
store_match() { | |
if [[ $2 =~ $pattern ]]; then | |
((++matches["${BASH_REMATCH[0]}"])) | |
fi | |
} | |
mapfile -c 1 -C store_match | |
for match in "${!matches[@]}"; do | |
printf "%4d %s\n" "${matches[$match]}" "$match" | |
done | |
} | |
using_perl() { | |
local pat="${1//\//\\/}" | |
perl -e "%matches;while(<>) { if (/$pat/) { \$matches{\$&}++; } }" \ | |
-e 'while(($k, $v) = each %matches) { printf("%4d %s\n", $v, $k); }' | |
} | |
using_python() { | |
local program | |
read -r -d '' program<<END_PYTHON | |
import re, sys | |
p=re.compile('$1') | |
m={} | |
for l in sys.stdin: | |
s = p.search(l) | |
if s: | |
g = s.group(0) | |
if g not in m: | |
m[g]=1 | |
else: | |
m[g]+=1 | |
for k in m: | |
print "%4d %s" % (m[k], k) | |
END_PYTHON | |
python -OO -c "$program" | |
} | |
one_at_a_time() { | |
local impl="$1" | |
local pattern="$2" | |
shift | |
shift | |
for f in "$@"; do | |
echo "$f:" | |
"$impl" "$pattern" < "$f" | |
done | |
} | |
concatenated() { | |
local impl="$1" | |
local pattern="$2" | |
shift | |
shift | |
cat "$@" | "$impl" "$pattern" | |
} | |
check_is_bash_v4() { | |
flag="$1" | |
if [[ "${BASH_VERSINFO[0]}" -lt 4 ]]; then | |
printf -- "$flag requires at least bash version 4.x; %s\n" \ | |
"this is version $BASH_VERSION." >&2 | |
return 1 | |
fi | |
} | |
count_matching_lines() { | |
if [[ $# -eq 0 ]]; then | |
usage | |
return | |
fi | |
local impl="using_bash_v${BASH_VERSINFO[0]}" | |
if ! command -v "$impl" >/dev/null; then | |
printf "This requires bash version 3.x or 4.x; this is version %s.\n" \ | |
"$BASH_VERSION" >&2 | |
return 1 | |
fi | |
local provide_files='one_at_a_time' | |
local arg | |
local pattern | |
local num_impl_flags=0 | |
while [[ $# -ne 0 ]]; do | |
arg="$1" | |
shift | |
case "$arg" in | |
--cat) | |
provide_files='concatenated' | |
;; | |
--grep|--perl|--python) | |
impl="using_${arg#--}" | |
((++num_impl_flags)) | |
;; | |
--mapfile) | |
if ! check_is_bash_v4 "$arg"; then | |
return 1 | |
fi | |
impl='using_bash_v4_with_mapfile' | |
((++num_impl_flags)) | |
;; | |
--callback) | |
if ! check_is_bash_v4 "$arg"; then | |
return 1 | |
fi | |
impl='using_bash_v4_with_mapfile_callback' | |
((++num_impl_flags)) | |
;; | |
--bashv3) | |
impl='using_bash_v3' | |
((++num_impl_flags)) | |
;; | |
*) | |
pattern="$arg" | |
break | |
;; | |
esac | |
if [[ $num_impl_flags -gt 1 ]]; then | |
printf "Only one of --grep, --mapfile, --callback, or --bash3 %s\n" \ | |
"may be specified." >&2 | |
return 1 | |
fi | |
done | |
local lines | |
local bytes | |
local total_lines | |
local total_bytes | |
local file_method="${provide_files//_/ }" | |
local impl_name="${impl#using_}" | |
local id="${impl_name//_ / } $file_method: $pattern" | |
echo "START $id" | |
for f in "$@"; do | |
while read -r lines bytes; do | |
echo " $f ($lines lines, $bytes bytes)" | |
((total_lines+=lines)) | |
((total_bytes+=bytes)) | |
done <<< "$(wc -lc $f | awk '{print $1, $2}')" | |
done | |
printf "\nRUN $id\n\n" | |
{ time "$provide_files" "$impl" "$pattern" "$@"; } 2>&1 | |
printf "\nFINISH $id\n\n" | |
printf 'Total files: %d\nTotal lines: %d\nTotal bytes: %d\n' \ | |
"$#" "$total_lines" "$total_bytes" | |
printf 'Average lines: %d\nAverage bytes: %d\n\n' \ | |
"$((total_lines/$#))" "$((total_bytes/$#))" | |
echo "END $id" | |
} | |
count_matching_lines "$@" |
OK, so I'm a bit obsessive... I added the count-matching-lines-driver.sh
script to make it easier to compare the performance and memory usage of the different implementations for different input sets, and I added all of the implementations to the original count-matching-lines.sh
script. (In the process, I had to debug a particularly subtle memusg
issue, extensively documented in a comment within do_run()
in the driver script.) I've also added embedded implementations in perl and python.
Hopefully the scripts may make it easier to evaluate which implementation might make the most sense for your input characteristics, invocation frequency, and memory constraints.
Here are some results that I put together on my machine:
Environment
- 2.9 GHz Intel Core i5
- 8 GB 1867 MHz DDR3
- bash 4.3.46(1)-release
- bash 3.2.57(1)-release.
- Perl v5.24.0
- Python 2.7.12 with -OO
I ran all but the last set of benchmarks using the expression local .
to match variable declarations in some bash scripts I'm working on. I concatenated and then copied files to produce a batch of larger files from the original scripts. The results marked -cat
are for when all the files were concatenated into single standard input stream; the other results indicate that files were processed one-at-a-time.
Conclusions
- In most cases, prefer
grep -o <pattern> | sort | uniq -c
. - Use perl or python if they're available and the performance of the
grep
pipeline is choked by memory usage. - Prefer one of the bash-only implementations when:
- you're invoking the search on many smaller files at a higher frequency, where the overhead of invoking
grep
,perl
, orpython
dominates. - you're memory-constrained and processing a large number of matches, where
thesort
execution dominates, and perl or python aren't an option (but make sure you use bash 4.x, or performance may still be unacceptable). - you're not guaranteed that tools other than bash are installed (i.e. Windows without MinGW or the Windows 10 Anniversary Update Windows Subsystem for Linux).
- you're invoking the search on many smaller files at a higher frequency, where the overhead of invoking
- Use bash 4.x instead of 3.x whenever you can get away with it, as the associative array mechanism is much faster than the insertion sort. (OS X and MinGW still ship with 3.x.)
- In bash 4.x:
- Don't use
mapfile
if memory usage is a constraint; otherwise it's slightly
faster. - Don't use
mapfile
with callbacks (mapfile -c 1 -C store_match
).
- Don't use
Caveat
The grep
and bashv3
versions have the output sorted by the matching string, and the others aren't sorted at all, since they iterate over associative arrays/hashes/dictionaries. If the number of lines matched is large, however, the output from these unsorted implementations may be sorted much less expensively, since grep
will output a line for each match that sort
and uniq
then must process. (I think this explains why, in the "One extremely large file" case, grep
was amongst the fastest implementations, but consumed the most memory other than bash4-mapfile
.)
One small file
Timing:
0m0.004s bashv4-mapfile
0m0.004s bashv4
0m0.004s perl
0m0.005s grep
0m0.007s bashv4-callback
0m0.008s bashv3
0m0.021s python
Memory usage:
872 bashv3
972 bashv4-mapfile
992 grep
1000 bashv4-callback
1016 perl
1020 bashv4
1060 python
Input file stats:
Total files: 1
Total lines: 162
Total bytes: 4000
Average lines: 162
Average bytes: 4000
A batch of small files
Timing:
0m0.005s perl-cat
0m0.006s grep-cat
0m0.023s bashv4-mapfile
0m0.026s bashv4
0m0.026s python-cat
0m0.033s bashv4-mapfile-cat
0m0.035s bashv4-cat
0m0.041s bashv3
0m0.045s bashv4-callback
0m0.048s grep
0m0.051s bashv4-callback-cat
0m0.056s bashv3-cat
0m0.056s perl
0m0.281s python
Memory usage:
872 bashv3
992 bashv4
992 bashv4-mapfile-cat
992 grep
996 perl-cat
996 python-cat
1000 bashv4-callback
1000 bashv4-callback-cat
1000 bashv4-mapfile
1000 grep-cat
1204 bashv4-cat
2364 bashv3-cat
2464 perl
5460 python
Input file stats:
Total files: 15
Total lines: 1106
Total bytes: 25423
Average lines: 73
Average bytes: 1694
One larger file
Timing:
0m0.004s perl
0m0.005s grep
0m0.020s python
0m0.022s bashv4-mapfile
0m0.025s bashv4
0m0.044s bashv4-callback
0m0.055s bashv3
Memory usage:
980 bashv4-mapfile
992 bashv4-callback
996 perl
1000 bashv3
1036 bashv4
1040 grep
1052 python
Input file stats:
Total files: 1
Total lines: 1106
Total bytes: 25423
Average lines: 1106
Average bytes: 25423
A batch of larger files
Timing:
0m0.008s perl-cat
0m0.018s grep-cat
0m0.027s python-cat
0m0.040s perl
0m0.043s grep
0m0.194s python
0m0.210s bashv4-mapfile
0m0.227s bashv4
0m0.410s bashv4-callback
0m0.432s bashv3
0m0.500s bashv4-mapfile-cat
0m0.538s bashv4-cat
0m0.775s bashv4-callback-cat
0m0.934s bashv3-cat
Memory usage:
900 python-cat
1000 grep-cat
1000 perl
1056 perl-cat
1072 grep
1572 bashv4
1588 bashv3
1760 bashv4-callback
2240 bashv4-mapfile
3140 bashv4-cat
3220 bashv3-cat
3672 bashv4-callback-cat
6004 python
6016 bashv4-mapfile-cat
Input file stats:
Total files: 10
Total lines: 11060
Total bytes: 254230
Average lines: 1106
Average bytes: 25423
One extremely large file
For this case, I filled a single file with 1100011 lines of '#! /bin/bash' and matched against that pattern. This shows that perl, python, and grep win in performance, but bash is more than twice as memory-efficient as perl, almost four times as memory efficient as python, and is thirty-four times as memory-efficient as grep.
Timing:
0m0.491s perl
0m0.863s python
0m10.785s grep
0m31.370s bashv4-mapfile
0m38.058s bashv4
0m53.232s bashv4-callback
1m32.163s bashv3
Memory usage:
1580 bashv4
1588 bashv3
3456 perl
6192 python
53876 bashv4-callback
54212 grep
251788 bashv4-mapfile
Input file stats:
Total files: 1
Total lines: 1100011
Total bytes: 14300143
Average lines: 1100011
Average bytes: 14300143
Oh, and I suppose beyond this, you could write the same thing in a compiled language and ship that with your script if I needed to squeeze every last drop of performance and memory efficiency out of it. But I think I'll stop on that note.
To test my memory usage hypothesis, I created a very large text file with the line
#! /bin/bash
repeated, then ran the script to match this line. While thegrep
run was ~9x faster (12.5s vs. 1m 54.7s), it used almost 3x as much memory (52.6M vs. 18.3M) as measured by thememusg
tool.