Skip to content

Instantly share code, notes, and snippets.

@shaunagm
Last active August 29, 2015 14:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shaunagm/dc9428726287664990f2 to your computer and use it in GitHub Desktop.
Save shaunagm/dc9428726287664990f2 to your computer and use it in GitHub Desktop.
wtf r?

data.frame(data.frame(numbers=c(1,2,3)), data.frame(letters=c("a", "b", "c"))) makes a data frame:

  numbers letters
1       1       a
2       2       b
3       3       c

data.frame(data.frame(numbers=c(1,2,3)), data.frame(letters=c("a", "b", "c", "d"))) does not:

Error in data.frame(data.frame(numbers = c(1, 2, 3)), data.frame(letters = c("a",  : 
  arguments imply differing number of rows: 3, 4

data.frame(data.frame(numbers=c(1,2,3)), data.frame(letters=c("a", "b", "c", "d", "e","f"))) makes a data frame, but not the one I'd expect:

  numbers letters
1       1       a
2       2       b
3       3       c
4       1       d
5       2       e
6       3       f

What's the reasoning behind this? I get that if one column is a factor of the other (numerically speaking) it will just repeat the content to make it fit. But I'm not sure why you'd want to do that as a default/without an error, or what about the structure of data frames/R made this seem like a good idea.

@benmarwick
Copy link

Yes, this is R's famous recycling rule... from the canon: "Shorter vectors in the expression are recycled as often as need be (perhaps fractionally) until they match the length of the longest vector." and "Any short vector operands are extended by recycling their values until they match the size of any other operands." It's not specific to data frames, but anything to do with vectors, and even functions.

So it's normal and intended for R, but not everyone agrees it's ideal (cf. this reaction, at 'Vector Operations'). The reason why it exists is because a vector in R is an ordered set of measurements (which can be useful to repeat) rather than a geometrical position or a physical state, as a vector is defined in other domains.

If R suspects recycling is unintended, i.e. when one length is not an integer multiple of another, then you'll get a warning. For example, c(1:3)/c(3:6) gives a warning, but c(1,2)/c(3:6) gives no warning. If it seems like you know what you're doing then it will happen silently. Sometimes it's helpful to recycle, sometimes it'll wreck everything...

@shaunagm
Copy link
Author

Thanks, Ben. Someone on #R also tipped me off to the "recycling" term.

The lack of a warning for when recycling is done evenly seems to rely on the assumption that if that happens, it must be intentional. That seems unwarranted to me.

I wrote a quick (and possibly buggy) piece of code to see how often a number is divisible by another number:

import numpy as np

percent = []
for i in range(2,1000):
    mod_count = 0.0
    total_count = 0.0
    for j in range (1,1000):
        if j < i:
            if i%j == 0:
                mod_count += 1.0
            total_count += 1.0
    percent.append(mod_count/total_count)
np.mean(percent)

Which gave me a result of 0.02632785643547798, which I interpret as, "For a random number between 1 and 1000, there is a 2% chance that a smaller number will go into it evenly."

I think this a conservative estimate of the number of times someone might accidentally recycle silently, because I think people are more likely to be working with multiples when working with real data. If I want to divide my 100 treatment+control measure A by my 100 treatment+control measure B but accidentally divide by my 50 treatment only measure B, I'm not going to notice that I've done anything wrong.

Now I'm tempted to search through published R code and see if there are any recycling errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment