You will quickly notice what many refer to as the "GNU bloat" - lots of feature flag SPAM - references to environment variables that will bite you.
- Rust MIT has performance from avoiding branch mispredictions and better cache efficency that BSD and GNU can learn from.
- Rust MIT has some parser safety innovations that are game changing on eliminating whole classes of bugs.
- Rust MIT has memory issues on several utilities.
- Rust MIT has huge binaries due to static linking the standard library.
- Rust MIT hasn't yet exploited many operating system call hints/advice yet.
- Rust MIT has a very immature test infrastucture.
- If you want to learn Rust help with coreutils - you will instantly know system calls better than 90% of the "professional" Rust developers out there.
- GNU has filesystem paranoia. What is snake oil and what should be adopted by all?
- Much better benchmarks and sharing of test suites.
- Need at least some formal specification with Alloy/TLA+.
- Be mindful of branch mispredictions. Write branch free code or recompile with PGO branch hints.
- More shell level testing. AWK style expectation tests - not language specific tests.
- Formal grammars of the input flags - no more YOLO parsing.
- Structured use of operating system advise on file access patterns etc.
- Suport for SIMD and multicore.
- System call expectation tests with eBPF, strace/dtrace - no good library for this yet.
- Wrapper on /usr/bin/time --dump_everything to replace Criterion style laguage specific benchark runners.
- Test with different shells. Fix shell bugs that hose your terminal. cough bash.
- Write tools so they can work in batch/streaming mode. Every exec() has overhead.
--- Raw notes if you want care how the saussge gets made ---
For Rust to be taken seriously as a systems language it needs to show it can keep up with or excede the performance of C. UNIX tools provide a playground that is as real world as it gets. In this post we will benchmark implemenations of the UNIX wc or "word count" utility.
For UUTILS wc we are building from commit 0bb5ad.
Apple wc is a fork of FreeBSD wc.
The test setup is simple. One 61M file of random. One 61M file of zeros.
#MacBook Pro (Retina, 13-inch, Late 2013)
#2.4 GHz Dual-Core Intel Core i5
#macOS 10.15.7
#GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin19)
#rustc 1.51.0-nightly (8a6518427 2021-01-16)
#cargo 1.51.0-nightly (a73e5b7d5 2021-01-12)
% dd if=/dev/urandom of=testfile bs=1 count=64000000
% /usr/bin/time -l target/release/coreutils wc testfile
249953 1497767 64000000 testfile
0.24 real 0.18 user 0.03 sys
2461696 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
441 page reclaims
209 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
45 voluntary context switches
231 involuntary context switches
% /usr/bin/time -l wc testfile
249953 1937880 64000000 testfile
0.67 real 0.63 user 0.02 sys
1757184 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
452 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
3 voluntary context switches
1710 involuntary context switches
% dd if=/dev/zero of=testfile2 bs=1 count=64000000
% /usr/bin/time -l target/release/coreutils wc testfile2
0 1 64000000 testfile2
0.24 real 0.15 user 0.05 sys
66572288 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
16095 page reclaims
207 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
36 voluntary context switches
178 involuntary context switches
% /usr/bin/time -l wc testfile2
0 1 64000000 testfile2
0.25 real 0.22 user 0.01 sys
1748992 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
448 page reclaims
2 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
8 voluntary context switches
136 involuntary context switches
According to StackOverflow involuntary context switches happen when the code spends over 10ms without a blocking system call. On the random 61M file BSD wc is really thrashing the CPU. UUTILS wc is twice as fast in the random case while taking 209 more pagefaults. For the zero case they are neck in neck - UUTILS wc being faster in crunching - BSD wc faster on system calls. However, UUTILS wc blows out the memory budget by an order of magnitude.
# macOS 11.1 aarch64
% /usr/bin/time -l target/release/coreutils wc testfile
249761 1498185 64000000 testfile
0.12 real 0.10 user 0.02 sys
3358720 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
221 page reclaims
14 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
8 voluntary context switches
25 involuntary context switches
1084576150 instructions retired
275311834 cycles elapsed
1376960 peak memory footprint
% /usr/bin/time -l wc testfile
249761 1936894 64000000 testfile
0.34 real 0.33 user 0.00 sys
2326528 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
151 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
41 voluntary context switches
0 involuntary context switches
2998822311 instructions retired
992942105 cycles elapsed
1901336 peak memory footprint
BSD wc still beats UUTILS wc on memory. UUTILS still wins on runtime.
% /usr/bin/time -l target/release/coreutils wc testfile2
0 1 64000000 testfile2
0.12 real 0.09 user 0.02 sys
84279296 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
5174 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
0 voluntary context switches
17 involuntary context switches
1706548151 instructions retired
281254470 cycles elapsed
82330656 peak memory footprint
% /usr/bin/time -l wc testfile2
0 1 64000000 testfile2
0.13 real 0.11 user 0.00 sys
2326528 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
151 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
20 voluntary context switches
0 involuntary context switches
2263587850 instructions retired
295982804 cycles elapsed
1901336 peak memory footprint
Again on the M1 they are neck in neck for runtime on the zero file and there is a memory issue on the UUTILS wc.
- Why is BSD wc having lower sytem overhead and using less memory?
- Why is UUTILS wc crushing on CPU performance in the random case?
- What is causing the memory blowup in UUTILS wc on the zeros file?
- Can we write a Criterion style script around /usr/bin/time -l to get distributions instead of single shot?
https://github.com/uutils/coreutils/blob/master/src/uu/wc/src/wc.rs
It has a table of separator characters it uses. Might be some room for making it branch free if LLVM isn't already.
const CR: u8 = b'\r';
const LF: u8 = b'\n';
const SPACE: u8 = b' ';
const TAB: u8 = b'\t';
const SYN: u8 = 0x16_u8;
const FF: u8 = 0x0C_u8;
#[inline(always)]
fn is_word_separator(byte: u8) -> bool {
byte == SPACE || byte == TAB || byte == CR || byte == SYN || byte == FF
}
Also looks like there are two passes for each line? Have to delve into what each is doing under the hood.
word_count += line.split_whitespace().count();
current_char_count = line.chars().count();
Apple hosts most UNIX tools here. BSD wc isn't in that list it is hosted here.
The workhorse subroutine. It uses a gotsp variable to create a state machine. There seems to be more branch complexity, and frequent memset() to zero out a mbstate_t struct for handlng multi-byte character encodings. It is a bit crazy that BSD wc has no test suite that I can find.
- Is UUTILS wc able to propperly handle multibyte character encodings?
GNU wc has a lot of paranoia on failed reads. It uses fadvise() to politely inform the Linux kernel that it indends to do sequential reads. The long_lines optimization uses inline code for short lines and function calls for long lines. A magic number of 15 was emperically derived and the comments note that this number should be re-calcuated for each architecture - looks like we should do this for the M1 chip.
if (lines - plines <= bytes_read / 15)
long_lines = true;
else
long_lines = false;
Instead of zeroing out the mbstate_t it compile time statically has one set to zero on the stack ready to use each iteration. Seems like it might be an improvement over BSD wc. Most logic is handled through switch statements for easy compiler branch optimization.
- Is it quicker to statically allocate a zeroed mbstate_t?
- Which wins on branch complexity for the seperator matcher? UUTILS byte equality, BSD nested ifs, or GNU switches?
- How much does advise of linear read help, and can this be done on BSD/XNU?
The posix_fadvise() doesn't exist in Apple libc which is an alias for XNU /sys/fcntl.h
The best XNU IO read/write mode paper I could find. XNU doesn't seem to have OS advice for sequential reads, but Apple in their GUI applications make extensive use of aio_read. This might be a performance enhancment by doing IO readahead. This mailing list post discusses using a F_RDAHEAD flag to ask the OS to cache reads. Defintely worth some benchmarking. How XNU handles these is in kern/kern_descrip.c.
Linux fs/fcntl.c uses gobs of headers, I would have to dig through them.
#WSL2 4.19.104-microsoft-standard kernel X86_64
#Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
#clflush size : 64
#cache_alignment : 64
#address sizes : 39 bits physical, 48 bits virtual
#cache size : 6144 KB
#
/usr/bin/time -v target/release/coreutils wc testfile
249577 1498406 64000000 testfile
Command being timed: "target/release/coreutils wc testfile"
User time (seconds): 0.41
System time (seconds): 0.12
Percent of CPU this job got: 29%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.80
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 20
Minor (reclaiming a frame) page faults: 200
Voluntary context switches: 8534
Involuntary context switches: 2
Swaps: 0
File system inputs: 5528
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
/usr/bin/time -v wc testfile
249577 1415624 64000000 testfile
Command being timed: "wc testfile"
User time (seconds): 2.10
System time (seconds): 0.07
Percent of CPU this job got: 52%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.12
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1964
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 95
Voluntary context switches: 3922
Involuntary context switches: 2
Swaps: 0
File system inputs: 80
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
```
This time we have GNU wc from the AWSLinux2 distro. Rust crushes it on urandom.
```bash
/usr/bin/time -v target/release/coreutils wc testfile2
0 1 64000000 testfile2
Command being timed: "target/release/coreutils wc testfile2"
User time (seconds): 0.33
System time (seconds): 0.08
Percent of CPU this job got: 25%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 66580
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 20
Minor (reclaiming a frame) page faults: 3089
Voluntary context switches: 8502
Involuntary context switches: 2
Swaps: 0
File system inputs: 5272
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
/usr/bin/time -v wc testfile2
0 0 64000000 testfile2
Command being timed: "wc testfile2"
User time (seconds): 1.56
System time (seconds): 0.08
Percent of CPU this job got: 58%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.78
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1856
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 93
Voluntary context switches: 3920
Involuntary context switches: 2
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Once again the Rust version has the memory regression, but still managed to beat GNU wc.
- Did Amazon do a bad job of compiling GNU wc?
- How much speedup can we get from Profile Guided Optimization and Link Time Optimization?
Comparing GNU wc and BSD wc isn't fair. We are going to need all three versions on all three targets.
Also, the wordcounts between Rust and the other two don't match. There is a specification error or a coding error here. Hypothesis that this read until LF has unbounded memory.
SO says gperftools, Apple Instruments, and Valgrind/Cachegrind can isolate the memory issue.
#Yep, the unbounded readuntil seems to be the issue
LD_PRELOAD="/lib64/libtcmalloc.so" HEAPPROFILE=/tmp/profile target/debug/coreutils wc testfile2
pprof --text --alloc_space --line target/debug/coreutils /tmp/profile.0001.heap
#128.0 100.0% uu_wc::wc /mnt/c/Users/chad/github/mit_coreutils/src/uu/wc/src/wc.rs:183
Not only is there tcmalloc, and jemalloc - for x86 there is also an Intel malloc.
Valgrind's massif was a bit easier to read:
valgrind --tool=massif target/debug/coreutils wc testfile2
ms_print massif.out.16103
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
21 2,211,129 4,760 4,586 174 0
22 2,275,049 5,720 5,546 174 0
23 2,287,029 7,456 7,274 182 0
24 2,297,142 7,456 7,274 182 0
25 2,320,594 7,416 7,250 166 0
26 2,335,756 7,744 7,558 186 0
27 2,345,636 7,832 7,599 233 0
28 2,352,868 7,808 7,590 218 0
29 2,376,187 7,896 7,665 231 0
30 2,388,027 4,160 3,992 168 0
31 2,395,815 4,224 4,025 199 0
32 2,412,678 4,248 4,035 213 0
33 2,418,439 20,648 20,413 235 0
34 2,430,211 28,840 28,605 235 0
35 2,441,992 45,224 44,989 235 0
36 2,463,733 77,992 77,757 235 0
37 2,505,394 143,528 143,293 235 0
38 2,586,895 274,600 274,365 235 0
99.91% (274,365B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->95.52% (262,294B) 0xA6D39B: alloc::alloc::realloc (alloc.rs:120)
| ->95.52% (262,294B) 0xA6D07B: alloc::alloc::Global::grow_impl (alloc.rs:196)
| ->95.52% (262,294B) 0xA6D562: <alloc::alloc::Global as core::alloc::AllocRef>::grow (alloc.rs:249)
| ->95.52% (262,294B) 0xA6937B: alloc::raw_vec::finish_grow (raw_vec.rs:492)
| ->95.52% (262,294B) 0xA69FF5: alloc::raw_vec::RawVec<T,A>::grow_amortized (raw_vec.rs:428)
| | ->95.52% (262,294B) 0xA69962: alloc::raw_vec::RawVec<T,A>::try_reserve (raw_vec.rs:317)
| | ->95.52% (262,294B) 0xA6A6A1: alloc::raw_vec::RawVec<T,A>::reserve (raw_vec.rs:311)
| | ->95.52% (262,294B) 0xA73197: alloc::vec::Vec<T>::reserve (vec.rs:505)
| | ->95.52% (262,294B) 0xA72D21: alloc::vec::Vec<T>::append_elements (vec.rs:1270)
| | ->95.52% (262,294B) 0xA72BAD: <alloc::vec::Vec<T> as alloc::vec::SpecExtend<&T,core::slice::iter::Iter<T>>>::spec_extend (vec.rs:2373)
| | ->95.52% (262,294B) 0xA72DE1: alloc::vec::Vec<T>::extend_from_slice (vec.rs:1602)
| | ->95.46% (262,144B) 0x2FE78F: std::io::read_until (mod.rs:1724)
| | | ->95.46% (262,144B) 0x2F7A61: std::io::BufRead::read_until (mod.rs:1906)
| | | ->95.46% (262,144B) 0x2F8790: uu_wc::wc (wc.rs:183)
| | | ->95.46% (262,144B) 0x291CBB: uu_wc::uumain (wc.rs:138)
Rust wonks knew about the bug - several times. Apparently the linereader crate is way faster - which also suggests trying bstr.
Benchmarks game has some examples. Might try the regex?
Rust simd-json also does an unsafe read everything to RAM instead of using a max buffer.
Panhole for struct optimization on ELF? Not sure what structs that coreutils uses or how Rust packs to structs that the tool would pick up on.
cd gnu_coreutils/tests/misc
wc-proc.sh wc-files0-from.pl wc-files0.sh wc.pl wc-nbsp.sh wc-parallel.sh
cp ../coreutils/target/release/wc ./src/wc
How uutils 'wc' is doing UTF-8 Validation
How GNU 'wc' is doing string decoding
How BSD 'wc' is doing string decoding
Overview of Rust vs C/C++ decoding
Right now Rust is doing unbounded buffering which is bad. GNU is doing 16*1024 = 16384 bytes BSD id doing 65536 bytes
# More efficent way of creating the test files
dd if=/dev/urandom of=test_urandom bs=1024 count=64000
dd if=/dev/zero of=test_zero bs=1024 count=64000
dd if=/dev/zero bs=1024 count=64000 | tr '\000' '1' > test_one
dd if=/dev/zero bs=1024 count=64000 | tr '\000' '\377' > test_binone
dd if=/dev/zero bs=1024 count=64000 | tr '\000' '\n' > test_newline
RUSTFLAGS=-Zsanitizer=address cargo test -Z build-std --target=aarch64-apple-darwin
I tried https://github.com/Freaky/cw/blob/9bd1157310533179fa1a8d1ec53b3363342e55a2/src/count.rs#L35 The file advise on my M1 didn't seem to help, and actually seem to hurt a small amount. Need to try Linux and Windows.
On the M1:
Stream Type | apple_wc | uu_wc | cw |
---|---|---|---|
urandom | 3.2 | 0.8 | 1.24 |
newline | 0.03 | 0.33 | 0.05 |
binary one | 0.96 | 1.59 | 0.3 |
zeros | 0.93 | 1.18 | 0.66 |
https://github.com/google/autocxx Library to call C++ from Rust, might be useful in benchmarking.
cargo test -Z timings is cool for benchmarking. Need to hack it to emit memory data also.