Skip to content

Instantly share code, notes, and snippets.

@sophiajt
Last active June 27, 2022 03:01
Show Gist options
  • Save sophiajt/ef0034e308bb4da95cdd69fd452bdc09 to your computer and use it in GitHub Desktop.
Save sophiajt/ef0034e308bb4da95cdd69fd452bdc09 to your computer and use it in GitHub Desktop.
A case for Nushell laziness

A case for Nushell laziness

Up to this point, Nushell has been built around the idea that commands stream data from one to the next via a pipeline. This pipeline is implemented as an iterator in Rust, allowing it to lazily compute the output as needed. A pipeline can be created, and only the rows needed are pulled through each stage.

This has served Nushell well. It's a simple, yet powerful, abstraction that commands can build on top of.

That said, I think it's time to experiment with a more powerful model. If this proves successful, it could put Nushell in a much better position for both higher performance but also a much more improved user experience.

Pipelines are too simple

The pipelines we do today allow laziness only of one form: rows are computed as needed. This means that we'll calculate columns that may not be needed by later stages of the pipeline. We'll also calculate inner tables which may also not be used by later stages of the pipeline. As a result, a command like this:

> ls **/* | where name =~ debug | length

Will do a lot of extra steps while accomplishing the task of counting the number of entries in the directory structure that have "debug" in the name.

Let's look some cases that also limited by the current architecture:

  • running file stats when they aren't need (as above): ls **/* | where name =~ debug | length
  • opening a large data file and only using a small part of it: open bigfile.json (note: table will only show the part the user will see. Any inner table will be ignored)
  • round-tripping updates to a data file: open config.toml | inc version --minor | save config.toml. Here we lose the original file because we've completely deserialized the file.
  • the user opens a million line csv file: open million.csv. The only way to break this is to ctrl-c, even though there's really no pratical reason to try to show them a million lines (note: for dataframe we already abbreviate the output)
  • we connect to an intelligent data system (database, dataframe, etc) and run nushell commands on it: connect mydb | where name == "bob". Ideally, we would wait and use the data system's native versions of those commands (like compile into sql for a database)

Design

To improve the current design, we'll focus on improving our internal API. Rather than being solely based on lazy streams, data sources will provide a more full-featured API that other commands could call into. For example, say we do this:

> ls | get name

Internally, the ls command would return the LsData provider that provides a set of methods (via a trait for data providers). One of these methods would be "extract data". The above would be internally translated into setting up the provider and then calling into the "extract data" method, returning the now-configured data provider. Let's look at the more complete example:

> ls | get name | table

We already know the first two steps: we have a data provider set up for with an "extract data" step. Finally, we call table which will run the provider and provide to the user a display of the content. The table viewer can do a few things for us, now that we're using this lazy provider syntax:

  • ask if a count of items is available
  • ask for the first n elements to display, so it can show them and then abbreviate

Both of the above would be additional configurations/actions on the provider. Once configured, table finally requests the data. The data requested is now fully configured. We know what data is being extracted and how much to extract (via, for example, the number of items that table wants to display).

With this, we only do the work we need to do and no more. The output of the above might look something like:

╭────┬───────────────────────╮
│  0 │ .cargo                │
│  1 │ CODE_OF_CONDUCT.md    │
│  2 │ CONTRIBUTING.md       │
│  3 │ Cargo.lock            │
│  4 │ Cargo.toml            │
│  5 │ LICENSE               │
│  6 │ README.md             │
│  7 │ README.release.txt    │
│  8 │ assets                │
│  9 │ build-all-maclin.sh   │
│ 10 │ build-all-windows.cmd │
│ .. │ ...                   │
╰────┴───────────────────────╯

With the additional advantage that we didn't ask for any file metadata in the process, as ls was configured to not need it.

All of the motivating examples would work in a similar way:

  • running file stats when they aren't need (as above): ls **/* | where name =~ debug | length

Similarly to the previous example, we're able to configure ls via where and length. From length, it learns that the file metadata may not be needed. If the where can be applied without it, and it can, then the metadata is never queried. The end result will faster.

  • opening a large data file and only using a small part of it: open bigfile.json (note: table will only show the part the user will see. Any inner table will be ignored)

table would query the JsonFile provider for only the part it needs for displaying (the top level parts). A smart JSON loader would then be able to skip much of the file and provide only what table viewer needed, and in a much faster way.

  • round-tripping updates to a data file: open config.toml | inc version --minor | save config.toml. Here we lose the original file because we've completely deserialized the file.

Here we use the TomlFile provider from start to finish, allowing it to be configured both to update a data element, but also to use more sophisticated handling of the file (eg, here perhaps using the 'toml-edit' crate). This would allow for updating the file without losing the user's formatting, comments, etc.

  • the user opens a million line csv file: open million.csv. The only way to break this is to ctrl-c, even though there's really no pratical reason to try to show them a million lines (note: for dataframe we already abbreviate the output)

table could configure the CsvFile to only yield the parts it needs, allowing the csv loading to be more lazy. This is technically all possible today in Nushell, but would happen as a natural part of the rework without additional logic and hacks.

  • we connect to an intelligent data system (database, dataframe, etc) and run nushell commands on it: connect mydb | where name == "bob". Ideally, we would wait and use the data system's native versions of those commands (like compile into sql for a database)

Laziness and relying on the data source to handle configuring the overall data query means connecting to a wide array of smart data sources now not only becomes possible, but also much more efficient than before. We can continue using Nushell commands, knowing that these will be translated into what the equivalent commands are in the native data sources.

What we're suggesting, in a way, is a light form of query planning for Nushell components as well as a way to defer to the query planning of more sophisticated data sources when the user connects to them.

Tradeoffs

This direction would mean that:

  • Each data source would potentially gain more capabilities so that they can perform better in the new model. While we would likely not require all data sources to do this, there would definitely be incentive to do it where possible.
  • We'd effectively be supporting two kinds of models: the lazy data and the lazy iterator model. This adds to complexity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment