Skip to content

Instantly share code, notes, and snippets.

@shaunagm
Last active August 29, 2015 14:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shaunagm/dc9428726287664990f2 to your computer and use it in GitHub Desktop.
Save shaunagm/dc9428726287664990f2 to your computer and use it in GitHub Desktop.
wtf r?

data.frame(data.frame(numbers=c(1,2,3)), data.frame(letters=c("a", "b", "c"))) makes a data frame:

  numbers letters
1       1       a
2       2       b
3       3       c

data.frame(data.frame(numbers=c(1,2,3)), data.frame(letters=c("a", "b", "c", "d"))) does not:

Error in data.frame(data.frame(numbers = c(1, 2, 3)), data.frame(letters = c("a",  : 
  arguments imply differing number of rows: 3, 4

data.frame(data.frame(numbers=c(1,2,3)), data.frame(letters=c("a", "b", "c", "d", "e","f"))) makes a data frame, but not the one I'd expect:

  numbers letters
1       1       a
2       2       b
3       3       c
4       1       d
5       2       e
6       3       f

What's the reasoning behind this? I get that if one column is a factor of the other (numerically speaking) it will just repeat the content to make it fit. But I'm not sure why you'd want to do that as a default/without an error, or what about the structure of data frames/R made this seem like a good idea.

@shaunagm
Copy link
Author

Thanks, Ben. Someone on #R also tipped me off to the "recycling" term.

The lack of a warning for when recycling is done evenly seems to rely on the assumption that if that happens, it must be intentional. That seems unwarranted to me.

I wrote a quick (and possibly buggy) piece of code to see how often a number is divisible by another number:

import numpy as np

percent = []
for i in range(2,1000):
    mod_count = 0.0
    total_count = 0.0
    for j in range (1,1000):
        if j < i:
            if i%j == 0:
                mod_count += 1.0
            total_count += 1.0
    percent.append(mod_count/total_count)
np.mean(percent)

Which gave me a result of 0.02632785643547798, which I interpret as, "For a random number between 1 and 1000, there is a 2% chance that a smaller number will go into it evenly."

I think this a conservative estimate of the number of times someone might accidentally recycle silently, because I think people are more likely to be working with multiples when working with real data. If I want to divide my 100 treatment+control measure A by my 100 treatment+control measure B but accidentally divide by my 50 treatment only measure B, I'm not going to notice that I've done anything wrong.

Now I'm tempted to search through published R code and see if there are any recycling errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment