chadbrewbaker/wc_bench.md Secret

## wc_bench.md

      
    Raw
  

              wc_bench.md
            
          
    Lightning Report on the State of Rust UNIX Tools for CIALUG January 2021 meeting


UNIX System 5-7 repo
Rust MIT repo
GNU repo
Apple repo
BSD repo
Illiuminos (Open Solaris) repo

You will quickly notice what many refer to as the "GNU bloat" - lots of feature flag SPAM - references to environment variables that will bite you.
Current state


Rust MIT has performance from avoiding branch mispredictions and better cache efficency that BSD and GNU can learn from.
Rust MIT has some parser safety innovations that are game changing on eliminating whole classes of bugs.
Rust MIT has memory issues on several utilities.
Rust MIT has huge binaries due to static linking the standard library.
Rust MIT hasn't yet exploited many operating system call hints/advice yet.
Rust MIT has a very immature test infrastucture.
If you want to learn Rust help with coreutils - you will instantly know system calls better than 90% of the "professional" Rust developers out there.

A better future for UNIX Tools


GNU has filesystem paranoia. What is snake oil and what should be adopted by all?
Much better benchmarks and sharing of test suites.
Need at least some formal specification with Alloy/TLA+.
Be mindful of branch mispredictions. Write branch free code or recompile with PGO branch hints.
More shell level testing. AWK style expectation tests - not language specific tests.
Formal grammars of the input flags - no more YOLO parsing.
Structured use of operating system advise on file access patterns etc.
Suport for SIMD and multicore.
System call expectation tests with eBPF, strace/dtrace - no good library for this yet.
Wrapper on /usr/bin/time --dump_everything to replace Criterion style laguage specific benchark runners.
Test with different shells. Fix shell bugs that hose your terminal. cough bash.
Write tools so they can work in batch/streaming mode. Every exec() has overhead.

--- Raw notes if you want care how the saussge gets made ---
Real Wold Rust Performance - wc edition.

For Rust to be taken seriously as a systems language it needs to show it can keep up with or excede the performance of C. UNIX tools provide a playground that is as real world as it gets. In this post we will benchmark implemenations of the UNIX wc or "word count" utility.
For UUTILS wc we are building from commit 0bb5ad.
Apple wc is a fork of FreeBSD wc.
The test setup is simple. One 61M file of random. One 61M file of zeros.
#MacBook Pro (Retina, 13-inch, Late 2013)
#2.4 GHz Dual-Core Intel Core i5
#macOS 10.15.7 
#GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin19)
#rustc 1.51.0-nightly (8a6518427 2021-01-16)
#cargo 1.51.0-nightly (a73e5b7d5 2021-01-12)

% dd if=/dev/urandom of=testfile bs=1 count=64000000
% /usr/bin/time -l target/release/coreutils wc testfile
   249953  1497767 64000000 testfile
        0.24 real         0.18 user         0.03 sys
   2461696  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       441  page reclaims
       209  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
        45  voluntary context switches
       231  involuntary context switches
       
% /usr/bin/time -l wc testfile 
  249953 1937880 64000000 testfile
        0.67 real         0.63 user         0.02 sys
   1757184  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       452  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         3  voluntary context switches
      1710  involuntary context switches

% dd if=/dev/zero of=testfile2 bs=1 count=64000000   
% /usr/bin/time -l target/release/coreutils wc testfile2
        0        1 64000000 testfile2
        0.24 real         0.15 user         0.05 sys
  66572288  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     16095  page reclaims
       207  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
        36  voluntary context switches
       178  involuntary context switches

% /usr/bin/time -l wc testfile2                        
       0       1 64000000 testfile2
        0.25 real         0.22 user         0.01 sys
   1748992  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       448  page reclaims
         2  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         8  voluntary context switches
       136  involuntary context switches
According to StackOverflow involuntary context switches happen when the code spends over 10ms without a blocking system call. On the random 61M file BSD wc is really thrashing the CPU. UUTILS wc is twice as fast in the random case while taking 209 more pagefaults. For the zero case they are neck in neck - UUTILS wc being faster in crunching - BSD wc faster on system calls. However, UUTILS wc blows out the memory budget by an order of magnitude.
# macOS 11.1 aarch64
 % /usr/bin/time -l target/release/coreutils wc testfile
   249761  1498185 64000000 testfile
        0.12 real         0.10 user         0.02 sys
             3358720  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 221  page reclaims
                  14  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   8  voluntary context switches
                  25  involuntary context switches
          1084576150  instructions retired
           275311834  cycles elapsed
             1376960  peak memory footprint
% /usr/bin/time -l wc testfile 
  249761 1936894 64000000 testfile
        0.34 real         0.33 user         0.00 sys
             2326528  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 151  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  41  voluntary context switches
                   0  involuntary context switches
          2998822311  instructions retired
           992942105  cycles elapsed
             1901336  peak memory footprint
BSD wc still beats UUTILS wc on memory. UUTILS still wins on runtime.
% /usr/bin/time -l target/release/coreutils wc testfile2
        0        1 64000000 testfile2
        0.12 real         0.09 user         0.02 sys
            84279296  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                5174  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  17  involuntary context switches
          1706548151  instructions retired
           281254470  cycles elapsed
            82330656  peak memory footprint
% /usr/bin/time -l wc testfile2 
       0       1 64000000 testfile2
        0.13 real         0.11 user         0.00 sys
             2326528  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 151  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  20  voluntary context switches
                   0  involuntary context switches
          2263587850  instructions retired
           295982804  cycles elapsed
             1901336  peak memory footprint
Again on the M1 they are neck in neck for runtime on the zero file and there is a memory issue on the UUTILS wc.

Why is BSD wc having lower sytem overhead and using less memory?
Why is UUTILS wc crushing on CPU performance in the random case?
What is causing the memory blowup in UUTILS wc on the zeros file?
Can we write a Criterion style script around /usr/bin/time -l to get distributions instead of single shot?

https://github.com/uutils/coreutils/blob/master/src/uu/wc/src/wc.rs
It has a table of separator characters it uses. Might be some room for making it branch free if LLVM isn't already.
const CR: u8 = b'\r';
const LF: u8 = b'\n';
const SPACE: u8 = b' ';
const TAB: u8 = b'\t';
const SYN: u8 = 0x16_u8;
const FF: u8 = 0x0C_u8;

#[inline(always)]
fn is_word_separator(byte: u8) -> bool {
    byte == SPACE || byte == TAB || byte == CR || byte == SYN || byte == FF
}
Also looks like there are two passes for each line? Have to delve into what each is doing under the hood.
 word_count += line.split_whitespace().count();
 current_char_count = line.chars().count();
Apple hosts most UNIX tools here. BSD wc isn't in that list it is hosted here.
The workhorse subroutine. It uses a gotsp variable to create a state machine. There seems to be more branch complexity, and frequent memset() to zero out a mbstate_t struct for handlng multi-byte character encodings. It is a bit crazy that BSD wc has no test suite that I can find.

Is UUTILS wc able to propperly handle multibyte character encodings?

GNU wc has a lot of paranoia on failed reads. It uses fadvise() to politely inform the Linux kernel that it indends to do sequential reads. The long_lines optimization uses inline code for short lines and function calls for long lines. A magic number of 15 was emperically derived and the comments note that this number should be re-calcuated for each architecture - looks like we should do this for the M1 chip.
if (lines - plines <= bytes_read / 15)
    long_lines = true;
else
    long_lines = false;
Instead of zeroing out the mbstate_t it compile time statically has one set to zero on the stack ready to use each iteration. Seems like it might be an improvement over BSD wc.  Most logic is handled through switch statements for easy compiler branch optimization.

Is it quicker to statically allocate a zeroed mbstate_t?
Which wins on branch complexity for the seperator matcher? UUTILS byte equality, BSD nested ifs, or GNU switches?
How much does advise of linear read help, and can this be done on BSD/XNU?

The posix_fadvise() doesn't exist in Apple libc which is an alias for XNU /sys/fcntl.h
The best XNU IO read/write mode paper I could find. XNU doesn't seem to have OS advice for sequential reads, but Apple in their GUI applications make extensive use of aio_read. This might be a performance enhancment by doing IO readahead. This mailing list post discusses using a F_RDAHEAD flag to ask the OS to cache reads. Defintely worth some benchmarking. How XNU handles these is in kern/kern_descrip.c.
Linux fs/fcntl.c uses gobs of headers, I would have to dig through them.
#WSL2 4.19.104-microsoft-standard kernel X86_64
#Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
#clflush size    : 64
#cache_alignment : 64
#address sizes   : 39 bits physical, 48 bits virtual
#cache size      : 6144 KB
#

/usr/bin/time -v target/release/coreutils wc testfile
   249577  1498406 64000000 testfile
        Command being timed: "target/release/coreutils wc testfile"
        User time (seconds): 0.41
        System time (seconds): 0.12
        Percent of CPU this job got: 29%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.80
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4120
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 20
        Minor (reclaiming a frame) page faults: 200
        Voluntary context switches: 8534
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 5528
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
/usr/bin/time -v wc testfile
  249577  1415624 64000000 testfile
        Command being timed: "wc testfile"
        User time (seconds): 2.10
        System time (seconds): 0.07
        Percent of CPU this job got: 52%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.12
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1964
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 1
        Minor (reclaiming a frame) page faults: 95
        Voluntary context switches: 3922
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 80
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
    ```
This time we have GNU wc from the AWSLinux2 distro. Rust crushes it on urandom.

```bash
 /usr/bin/time -v target/release/coreutils wc testfile2
        0        1 64000000 testfile2
        Command being timed: "target/release/coreutils wc testfile2"
        User time (seconds): 0.33
        System time (seconds): 0.08
        Percent of CPU this job got: 25%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 66580
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 20
        Minor (reclaiming a frame) page faults: 3089
        Voluntary context switches: 8502
        Involuntary context switches: 2
        Swaps: 0
        File system inputs: 5272
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
/usr/bin/time -v wc testfile2
      0        0 64000000 testfile2
       Command being timed: "wc testfile2"
       User time (seconds): 1.56
       System time (seconds): 0.08
       Percent of CPU this job got: 58%
       Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.78
       Average shared text size (kbytes): 0
       Average unshared data size (kbytes): 0
       Average stack size (kbytes): 0
       Average total size (kbytes): 0
       Maximum resident set size (kbytes): 1856
       Average resident set size (kbytes): 0
       Major (requiring I/O) page faults: 0
       Minor (reclaiming a frame) page faults: 93
       Voluntary context switches: 3920
       Involuntary context switches: 2
       Swaps: 0
       File system inputs: 0
       File system outputs: 0
       Socket messages sent: 0
       Socket messages received: 0
       Signals delivered: 0
       Page size (bytes): 4096
       Exit status: 0
Once again the Rust version has the memory regression, but still managed to beat GNU wc.

Did Amazon do a bad job of compiling GNU wc?
How much speedup can we get from Profile Guided Optimization and Link Time Optimization?

Comparing GNU wc and BSD wc isn't fair. We are going to need all three versions on all three targets.
Also, the wordcounts between Rust and the other two don't match. There is a specification error or a coding error here. Hypothesis that this read until LF has unbounded memory.
SO says gperftools, Apple Instruments, and Valgrind/Cachegrind can isolate the memory issue.
#Yep, the unbounded readuntil seems to be the issue
LD_PRELOAD="/lib64/libtcmalloc.so" HEAPPROFILE=/tmp/profile target/debug/coreutils wc testfile2
pprof --text --alloc_space --line target/debug/coreutils /tmp/profile.0001.heap
#128.0 100.0% uu_wc::wc /mnt/c/Users/chad/github/mit_coreutils/src/uu/wc/src/wc.rs:183
Not only is there tcmalloc, and jemalloc - for x86 there is also an Intel malloc.
Valgrind's massif was a bit easier to read:
 valgrind --tool=massif target/debug/coreutils wc testfile2
 ms_print massif.out.16103
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 21      2,211,129            4,760            4,586           174            0
 22      2,275,049            5,720            5,546           174            0
 23      2,287,029            7,456            7,274           182            0
 24      2,297,142            7,456            7,274           182            0
 25      2,320,594            7,416            7,250           166            0
 26      2,335,756            7,744            7,558           186            0
 27      2,345,636            7,832            7,599           233            0
 28      2,352,868            7,808            7,590           218            0
 29      2,376,187            7,896            7,665           231            0
 30      2,388,027            4,160            3,992           168            0
 31      2,395,815            4,224            4,025           199            0
 32      2,412,678            4,248            4,035           213            0
 33      2,418,439           20,648           20,413           235            0
 34      2,430,211           28,840           28,605           235            0
 35      2,441,992           45,224           44,989           235            0
 36      2,463,733           77,992           77,757           235            0
 37      2,505,394          143,528          143,293           235            0
 38      2,586,895          274,600          274,365           235            0
99.91% (274,365B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->95.52% (262,294B) 0xA6D39B: alloc::alloc::realloc (alloc.rs:120)
| ->95.52% (262,294B) 0xA6D07B: alloc::alloc::Global::grow_impl (alloc.rs:196)
|   ->95.52% (262,294B) 0xA6D562: <alloc::alloc::Global as core::alloc::AllocRef>::grow (alloc.rs:249)
|     ->95.52% (262,294B) 0xA6937B: alloc::raw_vec::finish_grow (raw_vec.rs:492)
|       ->95.52% (262,294B) 0xA69FF5: alloc::raw_vec::RawVec<T,A>::grow_amortized (raw_vec.rs:428)
|       | ->95.52% (262,294B) 0xA69962: alloc::raw_vec::RawVec<T,A>::try_reserve (raw_vec.rs:317)
|       |   ->95.52% (262,294B) 0xA6A6A1: alloc::raw_vec::RawVec<T,A>::reserve (raw_vec.rs:311)
|       |     ->95.52% (262,294B) 0xA73197: alloc::vec::Vec<T>::reserve (vec.rs:505)
|       |       ->95.52% (262,294B) 0xA72D21: alloc::vec::Vec<T>::append_elements (vec.rs:1270)
|       |         ->95.52% (262,294B) 0xA72BAD: <alloc::vec::Vec<T> as alloc::vec::SpecExtend<&T,core::slice::iter::Iter<T>>>::spec_extend (vec.rs:2373)
|       |           ->95.52% (262,294B) 0xA72DE1: alloc::vec::Vec<T>::extend_from_slice (vec.rs:1602)
|       |             ->95.46% (262,144B) 0x2FE78F: std::io::read_until (mod.rs:1724)
|       |             | ->95.46% (262,144B) 0x2F7A61: std::io::BufRead::read_until (mod.rs:1906)
|       |             |   ->95.46% (262,144B) 0x2F8790: uu_wc::wc (wc.rs:183)
|       |             |     ->95.46% (262,144B) 0x291CBB: uu_wc::uumain (wc.rs:138)
Rust wonks knew about the bug - several times.  Apparently the linereader crate is way faster - which also suggests trying bstr.
Benchmarks game has some examples. Might try the regex?
Rust simd-json also does an unsafe read everything to RAM instead of using a max buffer.
Panhole for struct optimization on ELF? Not sure what structs that coreutils uses or how Rust packs to structs that the tool would pick up on.
gnu test setup
cd gnu_coreutils/tests/misc
wc-proc.sh wc-files0-from.pl wc-files0.sh wc.pl wc-nbsp.sh wc-parallel.sh        
cp ../coreutils/target/release/wc ./src/wc
How uutils 'wc' is doing UTF-8 Validation
How GNU 'wc' is doing string decoding
How BSD 'wc' is doing string decoding
Overview of Rust vs C/C++ decoding
Right now Rust is doing unbounded buffering which is bad. GNU is doing 16*1024 = 16384 bytes  BSD id doing 65536 bytes
# More efficent way of creating the test files
dd if=/dev/urandom of=test_urandom bs=1024 count=64000
dd if=/dev/zero of=test_zero bs=1024 count=64000
dd if=/dev/zero bs=1024 count=64000 |  tr '\000' '1' > test_one
dd if=/dev/zero bs=1024 count=64000 |  tr '\000' '\377' > test_binone
dd if=/dev/zero bs=1024 count=64000 |  tr '\000' '\n' > test_newline
RUSTFLAGS=-Zsanitizer=address  cargo test -Z build-std  --target=aarch64-apple-darwin
I tried https://github.com/Freaky/cw/blob/9bd1157310533179fa1a8d1ec53b3363342e55a2/src/count.rs#L35
The file advise on my M1 didn't seem to help, and actually seem to hurt a small amount.
Need to try Linux and Windows.
On the M1:


Stream Type
apple_wc
uu_wc
cw


urandom
3.2
0.8
1.24


newline
0.03
0.33
0.05


binary one
0.96
1.59
0.3


zeros
0.93
1.18
0.66


https://github.com/google/autocxx Library to call C++ from Rust, might be useful in benchmarking.
cargo test -Z timings is cool for benchmarking. Need to hack it to emit memory data also.
Stream Type	apple_wc	uu_wc	cw
urandom	3.2	0.8	1.24
newline	0.03	0.33	0.05
binary one	0.96	1.59	0.3
zeros	0.93	1.18	0.66