It is with great glee that I can announce the latest release of the Portal Project Database. For those of you who just want to go play with the data – here’s the link to the Data Paper we just published in Ecology.
But I would encourage you to read on, as there is more data-related news below.
But first, a story.
As some of you know, I manage a long-term ecological study: the Portal Project. It was started by Jim Brown, Diane Davidson, and Jim Reichman back in 1977 to study competition and plant/animal interactions. That original team moved on (intellectually) and eventually retired. Tom Valone and I inherited the mantel of responsibility for the site. Jim Brown believed in sharing data with whomever asked for it, and in 2009 we formalized that philosophy by publishing all of the data from 1977-2003 that we felt was in good enough shape to document and share. We chose to release the data as an Ecology Data Paper, using Ecological Archives. Partly that was because I had great previous experiences publishing data through Ecology, and partly because I wanted something permanent. I’ve seen many people talk about their “publically available data” that was either not actually publically available, stored on a now-defunct personal website, or had so many data owner imposed hoops to jump through that it was effectively not public. I wanted the data to be available even if I died (a little grim, I know, but a real consideration when we talk about data archiving).
But we kept collecting data, which meant in 2013 we realized we had an additional 10 years of data we could share. We also had cleaned up and documented additional data that we wanted to add. So we started the process of publishing the next chunk of data. But how should we do this? Should we just add on to the existing Data Paper (assuming Ecological Archives allowed this it would be awkward since the title of the original data paper included the words 1977-2003)? We also decided to add all the graduate students who had been funded to collect the data for the project from 2003-2013, but tracking down people from the 1970s and 80s seemed unfeasible. The short version of the story is that we opted for a separate data paper for 2003-2013, but Ecological Archives wanted a new Data Paper with all the years of data in one place – so that’s what we ended up doing. Our new Data Paper contains all the data in the original Data Paper, plus the new years of data, plus old ant and weather data that we felt we now understood well enough to let loose in the world.
It should come as no surprise to those who follow this blog that we here at Weecology are interested in open science. I love Ecological Archives as a permanent repository1 – the data is safely in the public sphere even if I die, change universities, forget to update my website, or hand the research over to someone who doesn’t share my ideals. But publishing new data papers is a big ordeal that I only want to do every few years. If we want to make data available more rapidly (and we do), we needed another mechanism for delivery to the public.
Thus begins the Portal Project GitHub Database experiment.
What is GitHub?
Github is a web-based repository typically used for version control and management of software projects. We have created a repository on GitHub (https://github.com/weecology/PortalData) where we can create new releases of data after it has undergone our quality control processes. Here’s a screenshot of what this page looks like:
Version 1.0.0 (which is currently available) matches what is available on Ecological Archives and can be reached through this link: https://github.com/weecology/PortalData/releases or by clicking the release button on the main page of the repository (see above).
When will new data be released?
Our aim is to release a new vetted and updated version approximately every 6 months. However, you can also get our most up-to-date data from GitHub. You can find it on the main page (see figure above). As part of this process, we have moved our data entry and quality control processes to center around the Portal Data repository. Yes, that’s right, you’ll be able to access our new data as soon as we’ve entered it from our field datasheets. New data has not gone through the same level of quality control – so user beware. That data will be less stable than the release data.
GitHub met a variety of our data publishing and data management needs. I won’t go into everything here, but the big one is version control. Every time we make a change to the data files, it is documented. This has not been the case in the past. Though we did try to keep records, it relied on someone making a change in the database and then remembering to write it down somewhere. Now with our new setup, any changes will be automatically documented by commit messages (descriptions of changes that accompany any modification to a file on GitHub). It’s also publicly available, so users can use our history of changes as well, maybe to track down why results differ between two different downloads. How can you do this? Select one of the folders in the current repo – let’s randomly pick the rodent folder and look at the history of the rodent data file (Portal_rodent.csv)
This gives you all the commit messages that are associated with changes to this file. Maybe one of these catches your eye. You can see exactly what got changed by clicking on it.
The red shows you a row that has a deletion. The green a row that is “new”.
How do we feel about this shift to GitHub?
We were very nervous about this initially. While the White Lab has some serious Git-Fu skills, the Ernest Lab views itself as field ecologists and GitHub is not exactly intuitive to us. We worried we would screw up the data. We worried we were adding complexity to an already complex quality control process. But so far we are really happy about our new system. By integrating data entry into the data publishing process, it insures that we are always providing updated data, even if we’re slow on official releases. Version control is allowing us to document all the changes being made to the database – and everyone involved with the project has a chance to see the changes and comment on them if they have concerns. And everyone in our group (and now the world) has access to the most up to date data (and can choose between extremely current but still being vetted for errors or less current but more stable and less error prone). We’re not alone in taking this step to using GitHub for data management; other examples of projects that have moved to GitHub include the Biomass and Allometry Database for woody plants (BAAD) and the Open Tree of Life.
I want to end by saying that I don’t currently intend to stop submitting major updates to Ecological Archives or some other permanent repository. What GitHub provides is more transparency on how the data is being managed (both for people within and outside our group) and faster data streaming to other scientists than we’re capable of doing through Ecological Archives. But what it doesn’t do is provide the data in a stable way for ecologists in the future – and that is something we take very seriously! So if you only want to use our data via Data Papers, never fear, you now have all the data through 2013 and more will come eventually. But in the meantime, you might want to check out our data repository.
1 I might love it a little less right now since my data files are ‘Wiley Property’ housed on Wiley servers, but that’s a separate blog post.
What a great system! This: “publishing new data papers is a big ordeal that I only want to do every few years” really resonates, as I’m affiliated with two projects with massive datasets. We are actually using GitHub issues already for one of them to track “problems that need to be solved” and “things that need to be done”. Unfortunately, our data (for both projects) is measured in the TBs, which GitHub can’t handle. Do you have any thoughts on what to do if you push up against GitHub’s 100 MB file limit? Or are your files small enough that that you won’t run into this problem anytime soon?
Our files are small enough that it’s not an issue for us. But I feel like the file limit is something Ethan’s group has run into for some of their projects I’ll see if I can get him to swing by today to weigh in.
Pingback: Should Long-term Data be Open Access: Fears, Anecdotes, and Data | Jabberwocky Ecology
Pingback: Walking the beat | The Portal Project