Skip to content

Instantly share code, notes, and snippets.

@mmparker
Created April 18, 2014 00:03
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mmparker/11018133 to your computer and use it in GitHub Desktop.
Save mmparker/11018133 to your computer and use it in GitHub Desktop.
Ways to use a non-vectorized function on every row of a data.frame
# Sample data
X <- data.frame(
x = c(1, 2, 3),
y = c(4, 5, 6),
etc = c("a", "b", "c")
)
# Arbitrary stand-in for function that can't be vectorized (no pmax)
max.fun <- function(a, b) { max(c(a, b)) }
# Using dplyr
# First, tell dplyr how to group rows
# I'm using x, y here, but whatever uniquely identifies rows would work
library(dplyr)
Y <- group_by(X, x, y)
# This seems to do what you need without dropping any columns
Y <- mutate(Y, result = max.fun(x, y))
# Another option with data.table - for a long time, if you wanted
# the fastest possible ops, data.table was it. Not sure if dplyr
# has surpassed it, but for 5 million rows, it's worth a shot.
# Key advantage: no copying of objects in memory.
library(data.table)
Z <- as.data.table(X)
# No need to assign - this is the data.table magic
Z[, result := max.fun(x, y), by = list(x, y)]
# Check it
Z
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment