Skip to content

Instantly share code, notes, and snippets.

@joyrexus
Created November 5, 2013 03:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joyrexus/7313435 to your computer and use it in GitHub Desktop.
Save joyrexus/7313435 to your computer and use it in GitHub Desktop.
Table munging terminology

Table Munging Terminology

We need a simple vocabulary for talking about the manipulation of tabular data structures. That is, we want a clean and elegant way of describing how to munge tables. We want to slice and dice them with minimal pain. Implementation can follow once we settle on the right verbs for data manipulation operations and the right nouns for post-manipulated elements. Yes, we have relational algebra codified in SQL and its mungy extensions. So we're just talking about a coherent distillation of those parts potentially applicable to any tabular data store.

What follows are extracts from Hadley Wickham's dplyr project README file.

Context

You need to split up a big data structure into homogeneous pieces, apply a function to each piece and then combine all the results back together. For example, you might want to:

  • fit the same model for each subset of a data frame
  • quickly calculate summary statistics for each group
  • perform group-wise transformations like scaling or standardising

Unary verbs

  • select(): focus on a subset of variables
  • filter(): focus on a subset of rows
  • mutate(): add new columns
  • summarise(): reduce each group to a smaller number of summary statistics
  • arrange(): re-order the rows

Binary verbs

As well as verbs that work on a single table, we also want verbs describing how to operate on two tables at a time, viz. your standard joins:

  • inner_join(x, y): matching x + y
  • left_join(x, y): all x + matching y
  • semi_join(x, y): all x with match in y
  • anti_join(x, y): all x without match in y

Other

We might also want to provide head() and print() methods for summary display of our tables.

Existing tools

Further Reading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment