Skip to content

Instantly share code, notes, and snippets.

@dwbapst
Last active September 29, 2020 15:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dwbapst/05ec4874b1331e8c12617bd9347964f5 to your computer and use it in GitHub Desktop.
Save dwbapst/05ec4874b1331e8c12617bd9347964f5 to your computer and use it in GitHub Desktop.
MEE Blog Post for Bapst 2012 paleotree

paleotree in Methods in Ecology and Evolution: A Retrospective

David Bapst

Texas A&M University

dwbapst@tamu.edu

How I Ended Up Writing About My R Package in Methods in Ecology and Evolution Back In 2012

I was a fourth year graduate student when I first had the idea to make an R package. Quite a few people thought it was a bit silly, or a bit of a time-waste, but I thought it was the right thing to do at the time, and I think it has proven to be the right decision in hindsight. My dissertation was on the approaches available for dating phylogenies of fossil taxa when character data wasn’t available. At the time, it was very common to see papers dating trees by simply taking the age of the oldest appearing taxon in that clade, and assigning that date as the node age for that group. This was usually because we wanted to do macroevolutonary analyses with phylogenetic comparative methods, which means we needed dated trees, rather than any serious interpretation of the dates implied. Of course, phylogeny and stratigraphic order don’t always match, so when the oldest appearing taxon in a group was also very nested, a whole bunch of nodes would be shoved together, as if they had split simultaneously, in the same instant of time. So, sometimes, users would add a little extra time (say, a million years) to space out the nodes further, which for big trees, sometimes moved the root age backwards, tens of millions of years earlier. There were a few other approaches as well, all of which each person had to code themselves, or find the script some other user had written. Even worse, the transparency of what people were actually doing was often unclear -- some papers would say they did one thing in the paper, but their code showed that they did something else.

Unfortunately for me, I really wanted to apply comparative methods to datasets I was going to generate in my dissertation, but the situation of how to date a tree of fossil taxa was a mess, and the excuse of 'well, this is what everyone else does' wore thin. When I was being evaluated for candidacy, my committee made some rather pained faces everytime I tried to explain this, so I decided to try to do something about it, and make better methods. Along the way, I implemented the previous methods, and developed functions for simulating fossil data and their incompleteness, how ancestor-descendant relationships convert to poorly resolved cladograms, so that I could see what effect the different methods had on our inferences.

All of this seemed like stuff that other people would find useful, since it wasn’t available in an R package, and if it was useful enough to provide it at all, then to me it seemed worthwhile to create my own R package. One of my inspirations for this was Gene Hunt’s paleoTS R package, which implemented the time-series analyses that Gene had developed for testing for stasis and punctuated change in the fossil record. At a conference in 2010, I told Gene I was planning on putting what I’d presented into an R package, and he suggested there were already too many packages, and I should find someone else’s existing package to insert my functions into. I couldn’t explain why at the time, but that didn’t seem right -- it seemed important that I bear responsibility for maintaining my own code.

Of course, writing an R package wasn’t really meant to be a chapter of my dissertation, and even if I thought it was important to write it as a package, some suggested I should wait until after my PhD. So, to make the case for working on a manuscript that wasn’t a chapter, I suggested that I write a short paper accompanying the package, highlight the use of the package, and that this would really help make me standout after I graduated. I had seen the software reports describing R packages in Methods in Ecology and Evolution, and so I started making all my most useful code into functions, and then used package.skeleton (devtools had only been around for half a year on CRAN, and I wouldn’t start using devtools myself until 2013). I then discovered how each of my functions suddenly had an .Rd file, waiting for me to fill with documentation. Not wanting to delay submitting to CRAN before I submitted my short write-up to MEE, I spent a week doing the minimum documentation necessary. At that point, I discovered that programming was hard, but it was much more time consuming to write detailed documentation that would be understandable to another person.

Hindsight is Always 20/20

Looking over the 2012 paleotree paper in MEE now, its funny to think about what I decided to showcase.

I have a section on plotting taxonomic diversity over time in a given dataset, something I had spent a lot of work doing, as a way of validating my simulation methods, and yet (to my knowledge) I don’t think hardly anyone has used it in a publication, not even me. The methods for estimating sampling rates are also highlighted because I thought they would also become widely used (they didn’t, even after I expanded the repertoire of methods available in 2014). I also showcase the simulation methods a fair bit in the 2012 MEE paper, and while I know a few other workers who have used my simulation tools for some interesting uses (e.g. Quental and Marshall, 2015), I think my overall hope that my simulation tools would become widely used didn’t come to fruition. As it is, they had a number of novelties that aren’t really explained here, so I would need to detail the simulation algorithm in my 2013 PLOS One paper, and explain more details in my 2014 Paleobiology paper. The entire section actually no longer bears any relationship to the current simulation tools in paleotree, as I realized an alternative approach to simulating incomplete fossil records under different conditional scenarios, that was quicker and fixed a number of issues with my previous approaches.

(Actually, almost none of the function names, or argument names in Bapst (2012) have stayed the same since 2012. They have become more standardized, and better match best practices, despite my continued adherence to camelCase.)

Most studies which cite my 2012 paleotree paper do so because the authors of those studies used functions in paleotree to date a phylogeny. Many of the uses I am cited for are for people using dating methods I didn’t invent, I merely implemented in an R package. The ‘better method’ for dating trees that I was working on (now known as the cal3 method) is mentioned briefly (as the ‘src’ method a name I would replace a few months later). My later work (Bapst, 2014, Paleobiology; Bapst & Hopkins, 2016; Bapst et al., 2016; Lloyd et al., 2016) would reveal through simulation, and application to empirical that while the previous approaches could produce fairly misleading inferences, my 'improved' method was not as improved as I hoped.

The examples in the MEE paper are all on simulated data, rather than an empirical dataset, even for simple methods such as plotting diversity curves. I would realize in a year or two was making it difficult for new users to figure out how to use paleotree with their data at all. I am not sure how I did not realize, back in 2012, that using only simulations in my examples would be really off-putting to others. I definitely saw my folly when I started receiving a number of emails from confused users, asking how to get my simulation functions to ‘read their data’. Instead of using the MEE paper as a guide to using the package, I more often hear that people looked at a tutorial I posted a few years ago on my abandoned blog, involving a real dataset of retiolitid graptolites. (That blog post was originally going to be a formal vignette, but I never got around to it.)

What Has Happened Since 2012

While I am a little embarassed by the aged content of the MEE paper from 2012, the truth is, I never really stopped writing that paper. As a static document, that paper introduced users to paleotree in 2012, and although the usefulness of that static document today is questionable, the paper also serves as an anchor for the package itself, linking the package itself with the academic literature. The living package paleotree continually changes in ways that a standard journal article cannot, as I add new functions, maintain and improve existing functions, and clarify documentation. All of that work, a product of eight years of off-and-on development, can be easily cited by referring to the unchanging, 2012 MEE paper.

With regards to my motivation of improving how we as a community improved the transparency and reproducibility of dating analyses for fossil trees, I would say that many of the analyses that use simpler approaches now refer to explicit implementations in paleotree or similar packages (such as strap), such that its much easier now to figure out what most studies are actually implementing. While the cal3 method of dating from my dissertation is still occassionally used, it is instead much more common to see studies apply full fossilized-birth-death analyses ('FBD'; Heath et al., 2014, PNAS) using MrBayes, BEAST2, or RevBayes. In 2012, I couldn't imagine that standard suites of Bayesian phylogenetic inference software, like the above, would quickly adopt the FBD models that account for incompleteness in the fossil record, nor implement the ancestor-move (Gavryushkina et al., 2014, PLOS Computational Biology), allowing the MCMC to consider that some sampled tip taxa on a phylogeny might be direct ancestors of other tip taxa. In Bapst & Hopkins (2016), my co-author and I argued that cal3 had actually become old hat, such that workers insterested in cal3 should instead consider FBD analyses from standard inference software instead. Although this means the process itself cannot run within R, newer functions in paleotree exist for prepping the input files needs for doing FBD tip-dating analyses in MrBayes.

If developing paleotree and submitting to MEE has changed anything in this world at all, then it has changed me. Being the author and maintainer of a mildly-worn R package for the last eight years has been an extremely educational experience. As a paleontologist, I already had all the benefits and penalties of someone sitting at the intersection of earth science and evolutionary biology, but speaking with the broad base of paleotree users made me realize just how different perspectives could be within my own field, with the micropaleontologists, the invertebrate paleontologists, and the vertebrate paleontologists who contacted me for assistance sometimes having inverted senses of what terms common to our field were even supposed to mean, such as 'first appearance' and 'last appearance'. In a sense, it was almost as if each paleontologist, typically trained to a particular taxonomic group, started from a default assumption that everyone else's fossil record was just like the one they studied. This meant stumbling blocks for one group were not the same for other groups, and so the code and the documentation had to evolve to be understandable no matter what starting point a reader began from.

This meant I had to divorce myself from my own preconceptions when writing documentation for paleotree. Rather than assume that the words I used, like 'taxon', 'ancestor', and 'speciation' meant what they had meant to me, I had to instead carefully avoid words that people had deeply held but divergent preconceptions about, and find more general words to express what a particular function did. I had been told as graduate student that I should always write scientific papers as if the best, most knowledgeable person I could imagine was reading my study. For writing documentation, I flipped this on its head, and instead imagined myself when I was a beginning graduate student, with a head full of confused paradoxes and wrong assumptions, and an inability to translate warning and error messages from R into plain English. I took this perspective deeply to heart, that functions and documentation should be written as if the average user is likely the most confused and (perhaps) misled among us, and extoll it often when I review software articles for MEE and elsewhere. If we aren't writing our software to be used safely and wisely by first-year graduate students and undergrads, who do we think we are writing software for? I would say I have saved myself numerous times already by taking this approach, often stumbling on documentation from paleotree that carefully dissects some issue of practicality that I had entirely forgotten ever writing documentation about. So, perhaps the most confused and misled of all of us is simply ourselves, eight years later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment