$ curl http://teralink.net/\~serpent/words.txt > words.txt $ md5 words.txt MD5 (words.txt) = 640ef9e082ef75ef07f0e9869e9d8ae2 $du -h words.txt 217M words.txt
$ elixir ./count.ex ./words.txt --sort 1> out-sorted.txt Started Using 4 Workers Processing: 4.474982s Load Results: 0.717677s Order Results: 1.109028s Print Results: 2.277602s Total Runtime: 8.583487s $ elixir ./count.ex ./words.txt 1> out.txt Started Using 4 Workers Processing: 4.586417s Load Results: 0.776269s Print Results: 2.488612s Total Runtime: 7.855561s $ elixir ./count-stream.ex < words.txt 1> out.txt Started Processing: 4.148846s Reporting: 0.749119s Total Runtime: 4.909122s $ elixir ./count-stream.ex --sort < words.txt 1> out-sorted.txt Started Processing: 4.271136s Reporting: 1.856914s Total Runtime: 6.137186s
This solution uses multiple workers to perform reads against the same file. It does so by first using
File.stat/1 to obtain the size of the file, then generating range tuples (start position and bytes to be read) that can be farmed out to each worker. Workers are managed using
Tasks.async_stream/3 with no timeout. Furthermore, the concurrency level is set at 50% of all schedulers, which on a HT-enabled machine would, by default, give one worker per real CPU core.
Instead of using a Stream, which essentially allows processing one character at a time, this solution uses raw file handlers in workers, and so the level of concurrency is greater than a solution based on Streams.
As per Erlang/OTP documentation on the
raw mode, this mode is faster as no Erlang process is needed to handle the file. However, this means that either
IO.binread./2 must be used to read from the file.
Reading from the file is done in chunks with the file itself having been configured with a read-ahead value. This allows the runtime system to preload bits that the code will read later, and avoids spooling too much data at once within the functional layer of the code.
The Problem at hand requires counting of words, which essentially requires the program to walk the input stream. In this solution, we use a simple assumption, which is that any character which is not a newline or a space must be part of a word, and hence encountering either would mean that we have hit word boundary. This allows speedy accumulation of in-flight words, one character at a time. Words are accumulated backwards until they need to be reported, at which point the charlist is reversed, and the counter in ETS is increased.
When reading files in multiple chunks, it is possible that words would have been split between two chunks, so there is support for capturing residual prefix and suffix from each chunk. The prefix and suffix are not immediately reported; instead the top-level function glues them together and reports them to ETS after the workers have finished their jobs.
The module attributes
@bytes_read_chunkcan be tweaked.
Changing a shared ETS table to having each Worker write to its own ETS table, and integrating them later on, makes the code more complicated, and does not improve performance.
This is a new version created to address Johanna’s comments. The fundamental differences are that:
- It uses
IO.binread/2from STDIN and emits one chunk every
@bytes_read_chunkbytes (10 MB by default).
- It uses multiple workers to concurrently split and match words.
- It reuses the prefix / suffix logic to stitch together words at boundaries and mop them up at top level.
- It uses a compiled pattern (via
:binary.compile_pattern/2) to split binaries, which is then shared among workers.
- It reports the results with a single call to
Additionally, instad of splitting data into a known number of chunks each with an unknown size, this version splits the data into an unknown number of chunks each with a known size. This offers several benefits,
- We could stick with a single Stream which carries on producing more chunks.
- We can leverate Task.async_stream to manage the processing of each chunk.
- Many binary splitting functions provided by the
binarymodule can be used safely (some were slow if given a lot of data).
This greatly simplifies supervision logic.
Within this version there are two flags:
--unicodewhich toggles between
--sortwhich toggles sorting (order by count, desdending).