Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Imitating plyr and reshape in Julia
# A top priority for making DataFrames useful in Julia is the development of
# good documentation and a nice API for doing plyr+reshape style operations
# in Julia. This Gist is a draft of such documentation.
load("DataFrames")
using DataFrames
load("RDatasets")
baseball = RDatasets.data("plyr", "baseball")
baberuth = subset(baseball, :(id == "ruthba01"))
baberuth = within!(baberuth, :(cyear = year - min(year) + 1))
by(baseball, "id", df -> within(df, :(cyear = year - min(year) + 1)))
by(baseball, "id", df -> within!(df, :(cyear = year - min(year) + 1)))
baseball = subset(baseball, :(ab .>= 25))
#
# Still needs to be implemented
#
#xlim = range(baseball["cyear"], na.rm = TRUE)
#ylim = range(baseball["rbi"] ./ baseball["ab"], na.rm = TRUE)
#
# Translations needed
#
# R> model <- function(df) {lm(rbi / ab ~ cyear, data = df)}
# R> model(baberuth)
# R> bmodels <- dlply(baseball, .(id), model)
# R> rsq <- function(x) summary(x)$r.squared
# R> bcoefs <- ldply(bmodels, function(x) c(coef(x), rsquare = rsq(x)))
# R> names(bcoefs)[2:3] <- c("intercept", "slope")
# R> baseballcoef <- merge(baseball, bcoefs, by = "id")
# R> subset(baseballcoef, rsquare > 0.999)$id
@tshort
Copy link

tshort commented Dec 3, 2012

John, a wiki on DataFrames.jl might be better if you want the public to be able to edit. Anyway, here are some tweaks:

baberuth = subset(baseball, :(id == "ruthba01"))
# alternate way to do this:
baberuth = baseball[:(id == "ruthba01")]

baberuth = within!(baberuth, :(cyear = year - min(year) + 1))

# The following still leaves things grouped
bb1 = by(baseball, "id", df -> within(df, :(cyear = year - min(year) + 1)))
bb2 = by(baseball, "id", df -> within!(df, :(cyear = year - min(year) + 1)))
# Alternate way to do this (and faster); more like R's ave:
baseball["cyear"] = by(baseball, "id", :(cyear = year - min(nafilter(year)) + 1))["cyear"]

baseball = subset(baseball, :(ab .>= 25))
# or
baseball = baseball[:(ab .>= 25)]

range(x) = [min(nafilter(x)), max(nafilter(x))]
xlim = range(baseball["cyear"])
ylim = range(1.0 * baseball["rbi"] ./ baseball["ab"])

@tshort
Copy link

tshort commented Dec 3, 2012

I can't edit, so here's another tweak on different ways to do the cyear calc. The second is fastest.

within!(groupby(baseball, "id"), :(cyear = year - min(year) + 1))
baseball["cyear"] = by(baseball, "id", :(cyear = year - min(year) + 1))["cyear"]

@tshort
Copy link

tshort commented Dec 3, 2012

Here's another version that mostly does everything. Some parts are still clumsy.

load("DataFrames")
using DataFrames

load("RDatasets")

baseball = RDatasets.data("plyr", "baseball")

baberuth = subset(baseball, :(id == "ruthba01"))
# alternate way to do this:
baberuth = baseball[:(id == "ruthba01")]

baberuth = within!(baberuth, :(cyear = year - min(year) + 1))

# The following still leaves things grouped
## bb1 = by(baseball, "id", df -> within(df, :(cyear = year - min(year) + 1)))
## bb2 = by(baseball, "id", df -> within!(df, :(cyear = year - min(year) + 1)))

## within!(groupby(baseball, "id"), :(cyear = year - min(year) + 1))

baseball["cyear"] = by(baseball, "id", :(cyear = year - min(year) + 1))["cyear"]

# The following work, but I get singular systems in the lm below if used
## baseball = subset(baseball, :(ab .>= 25))
# or
## baseball = baseball[:(ab .>= 25)]

range(x) = [min(nafilter(x)), max(nafilter(x))]
xlim = range(baseball["cyear"])
ylim = range(1.0 * baseball["rbi"] ./ baseball["ab"])

# R> model <- function(df) {lm(rbi / ab ~ cyear, data = df)}
# R> model(baberuth)

load("DataFrames/demo/lm.jl")
# kludgy data fixups - lm isn't very robust, yet
baseball["rbiperab"] = with(baseball, :(1.0 * rbi ./ ab))
baseball["cyear"] *= 1.0
baberuth["rbiperab"] = with(baberuth, :(1.0 * rbi ./ ab))
baberuth["cyear"] *= 1.0
model(df) = lm(:(rbiperab ~ cyear), df)
model(baberuth)

# R> bmodels <- dlply(baseball, .(id), model)
idx = complete_cases(baseball[["id", "cyear", "rbiperab"]])
bmodels = by(baseball[idx,:], "id", model)  # a Dict

# R> rsq <- function(x) summary(x)$r.squared
# R> bcoefs <- ldply(bmodels, function(x) c(coef(x), rsquare = rsq(x)))

# Let's just resort to a loop to fill in this dataframe:
# Loops aren't bad in Julia (just a little wordy). {We might need a constructor for this.}
bcoefs = similar(DataFrame(:(id= "a"; intercept=0.0; slope=0.0; rsquare=0.0)), length(bmodels))
i = 1
for (k,m) in bmodels
    bcoefs[i,1] = k[1,1]
    bcoefs[i,2] = m.coefficients[1]
    bcoefs[i,3] = m.coefficients[2]
    bcoefs[i,4] = m.r_squared
    i += 1
end

# R> baseballcoef <- merge(baseball, bcoefs, by = "id")
# R> subset(baseballcoef, rsquare > 0.999)$id
baseballcoef = merge(baseball, bcoefs, "id")
baseballcoef[:(rsquare > 0.999), "id"]

@johnmyleswhite
Copy link
Author

johnmyleswhite commented Dec 6, 2012

Thanks for all these comments, Tom! I'm going to take your suggestion and set up a Wiki for this kind of stuff going forward.

@johnmyleswhite
Copy link
Author

johnmyleswhite commented Dec 6, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment