NikosAlexandris/streaming-and-memory_overview_table.md

## streaming-and-memory_overview_table.md

      
    Raw
  

              streaming-and-memory_overview_table.md
            
          
    Overview of Miller's streaming processing, and memory usage

Miller is streaming when possible (exceptions noted below) -- most verbs:

operate and store in memory a single and independent record at a time
don't have a state to retain from one record to the next
don't wait for complete ingestion of input before producing any output
(likewise, "statements" (i.e. $z = $x + $y) in the Miller programming language are implicit callbacks executed once per record)
operate on files which are larger than the system's memory
consume other programms output via pipe, e.g. tail -f some-file | mlr --some-flags
pipe output to other streaming tools (like cat, grep, sed,
etc.)

One disadvantage: streaming requires sometimes to accumulate results on records
(rows) as they arrive rather than looping through them explicitly.
Overview Table


Streaming
In-memory records
Memory-friendly
Output after end of input
Verbs


Fully
None
Yes
No
altkv, bar (if not auto-mode), cat, check, clean-whitespace, cut, decimate, fill-down, fill-empty, flatten, format-values, gap, grep, having-fields, head, json-parse, json-stringify, label, merge-fields, nest (if not implode-values-across-records), nothing, regularize, rename, reorder, repeat, reshape (if not long-to-wide), sec2gmt, sec2gmtdate, seqgen, skip-trivial-records, sort-within-records, step, tee, template, unflatten, unsparsify (if invoked with -f)


Half
Input files are streamed, join file (using -f) is loaded into memory at start


No
All
No

bar (if auto-mode), bootstrap, count-similar, fraction, group-by, group-like, least-frequent, most-frequent, nest (if implode-values-across-records), remove-empty-columns, reshape (if long-to-wide), sample, shuffle, sort, tac, uniq (if mlr uniq -a -c), unsparsify (if invoked without -f)


No
Bounded number of records
Yes
Yes
tail, top


No
An amount of state, less than all
Variably yes
Yes
count-distinct, count, histogram, stats1 (except mlr stats1 -s for incremental stats before end of stream), stats2, uniq (if not mlr uniq -a -c)


Variable, simple operations are fully streaming
Allows for logic to retain all
Yes except for logic retaining all records
End blocks executed after end of stream
filter, put


Table structure reference

Streaming

Fully-streaming
Half-streaming
Variable

State
Records in memory
For operations requiring deeper retention, Miller retains only as much data
as needed. For example, sort and tac must ingest and retain all records
in memory before emitting any -- the last input record may well end up
being the first one to be emitted.
Other verbs, such as tail and top, need to retain only a fixed number
of records -- 10, perhaps, even if the input data has a million records.
Yet other verbs, such as stats1 and stats2, retain only summary
arithmetic on the records they visit. These are memory-friendly: memory
usage is bounded. However, they only produce output at the end of the
record stream
Output
Streaming	In-memory records	Memory-friendly	Output after end of input	Verbs
Fully	None	Yes	No	`altkv`, `bar` (if not `auto-mode`), `cat`, `check`, `clean`-whitespace, `cut`, `decimate`, `fill`-down, `fill`-empty, `flatten`, `format`-values, `gap`, `grep`, `having`-fields, `head`, `json`-parse, `json`-stringify, `label`, `merge`-fields, `nest` (if not `implode-values-across-records`), `nothing`, `regularize`, `rename`, `reorder`, `repeat`, `reshape` (if not `long-to-wide`), `sec2gmt`, `sec2gmtdate`, `seqgen`, `skip`-trivial-records, `sort`-within-records, `step`, `tee`, `template`, `unflatten`, `unsparsify` (if invoked with `-f`)
Half	Input files are streamed, join file (using `-f`) is loaded into memory at start
No	All	No		`bar` (if `auto-mode`), `bootstrap`, `count`-similar, `fraction`, `group`-by, `group`-like, `least`-frequent, `most`-frequent, `nest` (if `implode-values-across-records`), `remove`-empty-columns, `reshape` (if `long-to-wide`), `sample`, `shuffle`, `sort`, `tac`, `uniq` (if `mlr uniq -a -c`), `unsparsify` (if invoked without `-f`)
No	Bounded number of records	Yes	Yes	`tail`, `top`
No	An amount of state, less than all	Variably yes	Yes	`count-distinct`, `count`, `histogram`, `stats1` (except `mlr stats1 -s` for incremental stats before end of stream), `stats2`, `uniq` (if not `mlr uniq -a -c`)
Variable, simple operations are fully streaming	Allows for logic to retain all	Yes except for logic retaining all records	End blocks executed after end of stream	`filter`, `put`