Skip to content

Instantly share code, notes, and snippets.

@rossmounce
Created August 12, 2013 16:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rossmounce/6212835 to your computer and use it in GitHub Desktop.
Save rossmounce/6212835 to your computer and use it in GitHub Desktop.
A _really_ basic script for doing many-to-many tree2tree distance (RF) comparisons in R, using the phangorn package and the function treedist. I should probably use one of the 'apply' functions here, right?
library(phangorn)
#264 REFERENCE trees in phylip format, PAUP numbering hence 2
ref2 <- read.tree("jackr2.tre")
#264 trees in phylip format to pair-wise compare to the reference trees, TNT numbering hence 1
tr2 <- read.tree("jack1.tre")
x <- {}
#all reference trees to one comp tree
for (i in 1:length(tr2)) {
for (j in 1:length(ref2)) {
x <- rbind(x,treedist(tr2[[i]],ref2[[j]]))
}
}
object.size(x) #8.5Mb table at the end :)
@gavinsimpson
Copy link

Never, NEVER, grow an object in an R for() loop. R has to allocate new storage for x and copy across the original x plus the new row. You waste so much time just growing and copying objects this way.

I'm not familiar with what treedist() returns, but you know how many rows x should have. Lets assume treedist() returns a numeric of length 1 (the distance between trees tri and try. First allocate storage for x

x <- matrix(ncol = 3, nrow = prod(length(tr2), length(ref2)))

Now loop

ii <- 1 ## loop counter
for (i in seq_along(tr2)) {
    for (j in seq_along(ref2)) {
        x[ii, ] <- c(i, j, treedist(tr2[[i]],ref2[[j]]))
        ii <- ii+1
    }
}

I'm being lazy and using a separate loop counter; you could probably work out how to fill x just from i and j alone.

Hopefully you get the gist of this? Allocate storage and then fill in.

@rossmounce
Copy link
Author

ah... that sounds logical. Loop counter, great idea! I just going to say I may not always know how many trees are in each treeset file, but this would appear to solve that problem nicely. Thanks!

@ethanwhite
Copy link

Line 4 should be indexing using ii right?

x[ii, ] <- c(i, j, treedist(tr2[[i]],ref2[[j]]))

I actually like the separate counter since it's going to be clearer and have a lower likelihood of bugs than a calculation based on i i and j.

@ethanwhite
Copy link

To be clear, I meant Line 4 in @gavinsimpson's very nice pre-allocation example.

@gavinsimpson
Copy link

@ethanwhite +1 on both points!. I was just hedging my bets in case someone commented here how easy it was to use i and j for the loop counter :-)

@gavinsimpson
Copy link

@ethanwhite @rossmounce Typo now fixed. Thanks.

@karthik
Copy link

karthik commented Aug 12, 2013

Looks like you don't even need to parallelize if this works fine.

There are many ways to parallelize in R but one quick and dirty way would be to split tr2 into a list, then mclapply it (a multicore lapply). If you spawn an instance on Amazon EC2 (there are many public AMIs with R installed; you'll just need to install phangorn and dependencies). Then just rewrite the code above into a function and pass the tr2 list and ref2. It will automatically split across available cores and throw the results back into a list. You can then rbind it easily (even locally).

hope this helps.

@rossmounce
Copy link
Author

thx @karthikram ... think I might wait til after I hand in my thesis til I try that. But since I picked up a free $100 Amazon EC2 gift card at the last #hack4ac I think I'll definitely be trying this out at some point. Thoroughly needed to scale-up my meta-analyses!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment