Skip to content

Instantly share code, notes, and snippets.

@vergenzt
Last active April 3, 2024 16:16
Show Gist options
  • Save vergenzt/0343786f511bd530ab6f016fc48825c1 to your computer and use it in GitHub Desktop.
Save vergenzt/0343786f511bd530ab6f016fc48825c1 to your computer and use it in GitHub Desktop.
Command line Wald–Wolfowitz runs test https://en.wikipedia.org/wiki/Wald-Wolfowitz_runs_test

If you have a command line that can generate a result, and you want to test whether successive results are statistically indepenent of each other, you can use the following:

seq <N_TESTS> \
| parallel -n0 <COMMAND...> \
| uniq -c \
| awk '{ NR%2 ? n1+=$1 : n2+=$1 } END { print NR, n1, n2 }' \
| python -c 'from statistics import NormalDist as N; import math; n_runs, n1, n2 = map(int, input().split()); n = n1+n2; μ=1.0*(2*n1*n2)/n + 1; σ=math.sqrt(1.0*(μ-1)*(μ-2)/(n-1)); Z = (n_runs-μ)/σ; p = N().cdf(-abs(Z)); p_two_tailed = p*2; print(); print(*list(vars().items())[-9:], sep="\n")'

(To explain: uniq -c counts unique lines within runs -- it only gets you overall unique counts if you sort the data first.)

p_two_tailed in this case represents the "probability of getting the observed results if they are statistically independent of each other". You can only conclude that the results are not independent if the p-value is small e.g. less than 0.05 or 0.01.

Note that you may need to either:

  1. add extra quoting to <COMMAND...> because parallel interprets the args as a shell script by default, or
  2. pass the -q/--quote flag to parallel if your shell command consists of a single program with args

You can also pass additional args (e.g. --progress) to parallel as long as they only output to stderr, not stdout.

Example

$ seq 500 \
> | parallel -n0 echo '$((RANDOM%2))' \
> | uniq -c \
> | awk '{ NR%2 ? n1+=$1 : n2+=$1 } END { print NR, n1, n2 }' \
> | python -c 'from statistics import NormalDist as N; import math; n_runs, n1, n2 = map(int, input().split()); n = n1+n2; μ=1.0*(2*n1*n2)/n + 1; σ=math.sqrt(1.0*(μ-1)*(μ-2)/(n-1)); Z = (n_runs-μ)/σ; p = N().cdf(-abs(Z)); p_two_tailed = p*2; print(); print(*list(vars().items())[-9:], sep="\n")'

('n_runs', 258)
('n1', 242)
('n2', 258)
('n', 500)
('μ', 250.744)
('σ', 11.157671366941461)
('Z', 0.6503149054468893)
('p', 0.25774441546058247)
('p_two_tailed', 0.5154888309211649)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment