Skip to content

Instantly share code, notes, and snippets.

@mikmart
Last active March 25, 2019 10:07
Show Gist options
  • Save mikmart/7d610d58b6b1e47e89abd573bcb435d5 to your computer and use it in GitHub Desktop.
Save mikmart/7d610d58b6b1e47e89abd573bcb435d5 to your computer and use it in GitHub Desktop.
Thoughts on what reshaping data is in general

Some thoughts on reshaping data

Key idea: Reshaping is essentially about decoding data from old column names, and then encoding other data into new column names. How the values move around, i.e. how shape of the data changes, is just a byproduct.

Method

Essential steps:

  1. Select columns that are transformed.
  2. Decode variables from selected column names.
  3. Encode variables into new column names.
  4. Create new columns.

How existing functions fit in

gather() allows selecting many columns to transform, but gives no control to parse variables from column names, or to decide what/how many new columns to create. Names of selected columns are just put into a single new column (the "key") and values of the selected columns are flattened into also a single column (the "value"). This does not allow reshaping columns with different types of data without loss of information.

In spread() you can only select one column to transform: the "value" column; consequently the decoding step (#2) is skipped, as if it's not there. New column names can also only be taken directly from the values of a single column (the "key" column). Not being able to transform many columns simultaneously leads to workflows that first require a gather() to get all of the desired values into a single column.

A new interface

Arguments

  1. data
  2. selection of columns to transform (tidyselect is a good tool here)
    • maybe also select columns to not transform in 2nd arg
  3. naming the variables to create from selected col names ("names_to")
    • just giving one col ("name") here is equivalent to gather()
  4. determining how to create the variables from col names
    • a separate() "sep" or extract() "pattern"
    • regex with named capture groups allows merging args 2-4
  5. selecting columns that contribute to creating the new columns:
    • can use existing cols in data and variables created from decoding names
    • needs to be an expression (i.e. "value" and value are different)
    • gather() uses no variables, just a static new name ("value")
    • spread() uses a single column, and takes its unique values
  6. specifying how to create new columns from given cols in #4
    • either paste() with a "sep" or a glue()/sprintf() a "template"
    • with a morph() could combine args 5 and 6

This transformation does not need a notion of "direction". The shape of the new data depends on whether there are more or fewer new columns created from the "molten" data.

Signature

recast(
    data,
    cols = everything(),
    keys = NULL, # exclude these from `cols`
    # decoding old cols
    parse_to = "name",
    parse_sep = character(),
    parse_pattern = character(),
    # encoding new cols
    build_from = "value",
    build_sep = "_",
    build_glue = character(), # for future with `morph()`
    build_format = character(),
    # enforcing types
    col_types = list()
)

# alternative naming scheme?
recast(
    data,
    cols = everything(),
    keys = NULL, # exclude these from `cols`
    # decoding old cols ("selected")
    sel_to = "name",
    sel_sep = character(),
    sel_pattern = character(),
    # encoding new cols
    new_from = "value",
    new_sep = "_",
    new_glue = character(), # for future with `morph()`
    new_format = character(),
    # enforcing types
    col_types = list()
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment