Skip to content

Instantly share code, notes, and snippets.

@NikosAlexandris
Last active September 6, 2021 00:25
Show Gist options
  • Save NikosAlexandris/416564ed560b3ae917d8fe9323070196 to your computer and use it in GitHub Desktop.
Save NikosAlexandris/416564ed560b3ae917d8fe9323070196 to your computer and use it in GitHub Desktop.
Overview of Miller's streaming processing, and memory usage

Overview of Miller's streaming processing, and memory usage

Miller is streaming when possible (exceptions noted below) -- most verbs:

  • operate and store in memory a single and independent record at a time
  • don't have a state to retain from one record to the next
  • don't wait for complete ingestion of input before producing any output
  • (likewise, "statements" (i.e. $z = $x + $y) in the Miller programming language are implicit callbacks executed once per record)
  • operate on files which are larger than the system's memory
  • consume other programms output via pipe, e.g. tail -f some-file | mlr --some-flags
  • pipe output to other streaming tools (like cat, grep, sed, etc.)

One disadvantage: streaming requires sometimes to accumulate results on records (rows) as they arrive rather than looping through them explicitly.

Overview Table

Streaming In-memory records Memory-friendly Output after end of input Verbs
Fully None Yes No altkv, bar (if not auto-mode), cat, check, clean-whitespace, cut, decimate, fill-down, fill-empty, flatten, format-values, gap, grep, having-fields, head, json-parse, json-stringify, label, merge-fields, nest (if not implode-values-across-records), nothing, regularize, rename, reorder, repeat, reshape (if not long-to-wide), sec2gmt, sec2gmtdate, seqgen, skip-trivial-records, sort-within-records, step, tee, template, unflatten, unsparsify (if invoked with -f)
Half Input files are streamed, join file (using -f) is loaded into memory at start
No All No bar (if auto-mode), bootstrap, count-similar, fraction, group-by, group-like, least-frequent, most-frequent, nest (if implode-values-across-records), remove-empty-columns, reshape (if long-to-wide), sample, shuffle, sort, tac, uniq (if mlr uniq -a -c), unsparsify (if invoked without -f)
No Bounded number of records Yes Yes tail, top
No An amount of state, less than all Variably yes Yes count-distinct, count, histogram, stats1 (except mlr stats1 -s for incremental stats before end of stream), stats2, uniq (if not mlr uniq -a -c)
Variable, simple operations are fully streaming Allows for logic to retain all Yes except for logic retaining all records End blocks executed after end of stream filter, put

Table structure reference

Streaming

  • Fully-streaming
  • Half-streaming
  • Variable

State

Records in memory

For operations requiring deeper retention, Miller retains only as much data as needed. For example, sort and tac must ingest and retain all records in memory before emitting any -- the last input record may well end up being the first one to be emitted.

Other verbs, such as tail and top, need to retain only a fixed number of records -- 10, perhaps, even if the input data has a million records.

Yet other verbs, such as stats1 and stats2, retain only summary arithmetic on the records they visit. These are memory-friendly: memory usage is bounded. However, they only produce output at the end of the record stream

Output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment