Over the last year and a half we have been actively developing a semester-long Data Carpentry course designed to be easily customized and integrated into existing graduate and undergraduate curricula.
Data Carpentry for Biologists contains course materials for teaching scientists how to work more effectively with data. The course provides introductions to data management and relational databases, data manipulation and analysis, and data visualization. It covers the same general types of material as a two-day Data Carpentry workshop, but expands the materials and opportunities for practice into a full-length university course. The teaching material uses R and SQLite, with some corresponding materials for Python as well. To help students understand the direct applications to their interests, the examples and exercises focus on biological questions and working with real data. The course emphasizes using best practices to produce reusable and reproducible data analysis.
Active-learning Teaching Materials
Learning computing requires active practice by working through programming problems. Just diving in to computing is challenging for most scientists, so the course instruction is designed to combine short live-coding introductions to concepts followed immediately by the students working on a related exercise. Additional exercises are assigned later for practice. This follows the “I do”, “We do”, “You do” approach to teaching, which leverages the benefits of active-learning and flipped classrooms without leaving students who are less comfortable with the material feeling lost. The bulk of class time is spent working on assigned exercises with the instructor moving around the room helping guide students through things they don’t understand and engaging with students who are thinking about advanced applications of what they’ve learned.
This approach is the result of lots of reading about effective teaching methods and Ethan’s experience teaching this and related courses over the last six years at Utah State University and the University of Florida. It seems to work well for both students that get the material easily and those that find it more challenging. We’ve also tried to make these materials as useful as possible for self-guided students.
Open course development
Software Carpentry and Data Carpentry have shown how powerful collaborative lesson development can be and we’re interested in bringing that to the university classroom. We have designed the course materials to be modular and easy to modify, and the course website easy to clone and set up. All of the teaching materials and associated website files are openly available at the Data Carpentry for Biologists repository on GitHub under CC-BY and MIT licenses. The course materials are all written in Markdown and everything runs on Jekyll through GitHub Pages. Making your own version of the course should take less than an hour. We’ve developed documentation for how to create your own version of the course and how to contribute to development. Exercises and assignments are modular and changing exercises and assignments simply involves reordering items in a list. Adding a new exercise involves creating a new Markdown file and then adding its title to the list of exercises for an assignment.
If you teach, or want to teach, a course like this, we’d love to get you involved. Here are some useful links for getting started.
We want to be sure getting involved is as easy as possible. We’ve worked hard to provide documentation and help resources for students and instructors. Students can find all they need to know at our student start guide. Instructors have access to course content and site design documentation.
Development of this course was generously support by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.
PH.D STUDENT OPENING IN COMMUNITY ECOLOGY IN THE ERNEST LAB
The Ernest Lab at the University of Florida has an opening for a Ph.D student in the area of Community Ecology to start fall 2017. The student will be supported as a graduate research assistant as part of an NSF-funded project at the Portal Project, a long-term research site in southeastern Arizona to study regime shifts (rapid shifts in ecosystem structure and function). A short version of the grant proposal can be found here. This position will participate in data collection efforts in Arizona on rodents and plants.
The Ernest lab is interested in general questions about the processes that structure communities, with a particular focus on understanding when and how ecological communities change through time. Students are free to develop their own research projects depending on their interests. Examples of questions in community ecology that students have pursued as part of their dissertation include: Does strong frequency dependence help buffer rare species from stochastic extinctions?, Are biodiversity patterns sensitive to changes in biotic interactions?, and Do disturbances impact species populations and community-level properties similarly?
The Ernest Lab is part of the Weecology research group, Weecology is a partnership between the Ernest Lab, which tends to be more field and community ecology oriented and the White Lab, which tends to be more quantitatively and computationally oriented. The Weecology group supports and encourages students interested in a variety of career paths. Former weecologists are currently employed in the tech industry, with the National Ecological Observatory Network, at teaching-focused colleges, and as postdocs in major research groups. We are also committed to supporting and training a diverse scientific workforce. Current and former group members include a variety of racial and ethnic backgrounds from the U.S. and other countries, members of the LGBT community, military veterans, and students who are the first generation in their family to go to college.
Interested students should contact Dr. Morgan Ernest (email@example.com) by Oct 24th, 2016 with their CV, transcripts (unofficial are perfectly ok), and a brief statement of research interests.
From how we do science to publishing practices to the sociology of science, there isn’t an aspect of the scientific endeavor that isn’t in flux right now. Long-term readers of Jabberwocky know that understanding how the scientific endeavor is changing and figuring out how to maximize the good for science and minimize the bad is a bit of an obsession for us. Ethan has been a tireless proponent (or “a member of the radical fringe” as one former ESA president called him) for changes in scientific publishing and reproducibility. For me, the issue close to my heart is data availability. For me, this is a “for the good of science” issue. By definition, science relies on data. If data is stuck in a drawer and no one knows about it or it becomes technologically inaccessible (I’m looking at you 5 1/4” floppy disk) then it effectively does not exist for the scientific endeavor. At best it is knowledge that needs time and resources (neither of which we have in abundance) to reacquire. At worst that is knowledge lost forever.
But publishing one’s data for the world to use is not the ecology way – in part because extracting data from nature is hard. Much of the resistance is because scientists are afraid they will lose out on credit for that hard work. If they are the one to publish on that data, they get the credit. If someone else does, the credit is less. Regularly, I see a journal article, or a tweet, or a blog post worried about the increasing push to make data publicly available. Most of these just make me sad for science, but one in particular has haunted me for a while because it focused on something near and dear to my heart: Long-Term Data. A paper published by Mills et al. in TREE last year argued that sharing long-term data will kill off long-term studies. This paper conducted a survey of 73 Principal Investigators running long-term ecology projects. Almost all said they were in favor of sharing data “with the agreement or involvement of the PI”. Only 8% were in favor of “open-access data archiving”1. 91% supported data-sharing when there were clear rules for how/when the data would be shared. Suggestions for rules included “(i) coauthorship or at least acknowledgment, depending on the level of PI involvement; (ii) no overlap with current projects, particularly projects conducted by students or postdoctoral fellows; and (iii) an agreement that the data go no further than the person to whom they are entrusted.”
My colleagues were so against open-access data archiving that many said they would rather submit their science to a less high profile journal if publishing in a more high profile journal required them to archive their data2. The paper argues that this type of decision making will result in less impactful papers and harm the careers and funding opportunities for scientists studying long-term ecology. Fears were expressed that flawed science would be produced without the input of the PIs, that being scooped would damage the careers of their trainees, concern that time would get wasted due to redundant analyses being conducted as multiple labs do the same analyses, a reduction in the number of long-term studies due to lower incentives for conducting this type of science, less collaboration, and lost opportunities to obtain new funding because research is being done by other groups. You get the idea. Sharing data results in lost opportunities to author papers with cascading consequences.
Having just published the next installment of the Portal Project long-term data and begun our on-line data streaming experiment, it seems like an ideal time to talk about my experiences and concerns with sharing long-term data. After all, unlike many people who have expressed these fears, my raw data3 has been openly available since 2009. How calamitous has this experience been for me and my students?
Since the database was published in 2009, it has been cited 16 times. That’s about 2ish papers a year, not exactly an impressive flurry of activity – though you could argue that it would still be a significant boost to my productivity. But the picture changes when you look at how exactly the data are being used. Of those 16 citations, 4 cite the Data Paper to support statements about long-term data being published, 4 use the data as one of many datapoints as part of a meta-analysis/macroecological study, 3 use the data to plot a data-based example of their idea/concept/tool (i.e. the data is not used as part of an analysis that is being interpreted), 3 use the data to ask a scientific question focused on the field site, 1 cites the data paper for a statement in the metadata about the importance of experimental manipulations and 1 cites it for reasons I cannot ascertain because I couldn’t access the paper but the abstract makes it clear they are not analyzing our data. No one is going to add me a co-author to cite the existence of long-term data, cite statements made in the metadata, or make an example figure, so we’re down to 7 papers that I “lost” by publishing the data. But it gets even worse4. I am already a co-author on the 3 papers that focus on the site. So, now we’re down to 4 meta-analysis/macroecological studies. As someone who conducts that type of research I can tell you that I would only need to include someone as an author to get access to their data if I desperately needed their particular data for some reason (i.e. location, taxa, etc) or if I can’t get enough sites to make a robust conclusion otherwise. There is a lot of data available in the world through a variety of sources (government, literature, etc). Given the number of studies used in those 4 papers, if I had demanded authorship for use of my data, my data would probably not have been included.
Final tally: We published one of the few (and among the longest-term) long-term datasets on climate, plants, and consumers for a single site in existence. This dataset even includes a long-term experimental manipulation (the ‘gold standard’ of ecology). That data has been openly available with no limitations for 7 years. I cannot yet point to an instance where someone used the data in a way that has cost us a paper – either through scooping us or because if the data had not been available they would have been forced to collaborate with me.
In fairness, I don’t expect that to be true forever, but right now our challenge isn’t how to avoid being scooped, it’s how do we get anyone to use our data!!! When I talk about this data online or give talks, invariably people tell me “you are so lucky to have a data set like that”. My response: The entire world has a data set exactly like this because we published it! Not one of those people has published anything with it.
My experience is not unique. A response paper was published this year by Simon Evans examining this lost opportunity cost of making long-term data publicly available. Evans combed through Dryad (the data archival site for many ecology and evolution journals) to see how many times long-term data archived on the site had been used. Using a 5 year or more definition for long-term data, there were 67 publicly accessible datasets on Dryad.5 How often had these data been used? Using citations to the data package, examining citations to the original paper associated with the data package, and contacting the data authors to see if they knew of instances of their data being used, Evans found that there were 0 examples of deposited data being reused by investigators not authors on the original study.5
Most people I know who are looking for data often forget about Dryad, so maybe Dryad just hasn’t been ‘discovered’. I would be interested to know how Evans’ result compares to data being downloaded from Ecological Archives. But given our experience with the open Portal Project data on Ecological Archives, I suspect differences in long-term data usage between different repositories is small.
So, currently there is no evidence that publishing long-term data results in the negative impacts based on the Mills et al paper. Does it have a positive impact? For long-term data, I’m currently unclear because so few people seem to be using that type of data. But I have published some macroecological data sets in the past (Ernest 2003, Smith et al 2003, Myhrvold et al 2015) and there have definitely been positives for me. No, it has not resulted in papers for me, but I have also not been scooped on anything I was actively working on or even seriously interested in pursuing. But they have resulted in a fair number of citations (397 to date via Google Scholar), they contribute to my h-index (which universities use to judge me), and have definitely contributed to my name recognition in that area of ecology. (I have been tongue tied on more than one occasion when a big name walked up to me and thanked me for publishing my 2003 mammal life history dataset). No, these aren’t publications, but name recognition is ever harder to obtain in the increasingly crowded field of science, and citation and impact metrics (for better or worse) are increasingly a part of the assessment of scientists. So yes, I believe that publishing datasets in general has been a net positive for me.
Finally, I can also say that NSF is watching. At the end of my previous NSF grant supporting the research at the field site, my program officer was in my ear reminding me that as part of that grant I needed to publish my data. I don’t know how other people feel about this, but I feel that a happy NSF is a positive.
So, in my experience, publishing my long-term data has not resulted in the grand implosion of my research group. If anything, I think the relative dearth of activity using the long-term data – especially in comparison to the macroecological datasets – suggests that very few people are actually using long-term data. To me, this lack of engagement is much more dangerous for the continuation of funding for long-term ecology than the nebulous fears of open data. If people don’t actively see how important this type of data is, why would they ever recommend for it to be funded? Why prioritize funding long-term data collection – a data type most ecologists have never used and don’t understand – over an experiment which most ecologists do and understand. We need more advocates for long-term ecology and I don’t believe you can do that by tightly controlling access so only a lucky few have access to it. So if you’re wondering why we now stream the data nearly live on GitHub, or why we make the data available to be used as a teaching database for DataCarpentry, that’s why. Long-term datasets – and a large body of scientists who understand how to work with them – are going to be important in tackling questions about how and why ecosystems change through time (not just in the past but into the future). This makes increasing the number of people working with long-term data a win for science – and in the long-run I believe it will be a win for me and others who invest so much blood, sweat and tears into generating long-term data.
1 I have never had quantitative evidence before that I was not “normal”. My reaction was to give myself a pat on the back, but I couldn’t figure out if the pat was consolatory or congratulatory.
2 This language is drawn from the original paper and does not reflect my opinions on “high” vs “low” impact journals.
3 In the spirit of complete disclosure, this data is not exactly mine. I’ve collected a lot of it since 1995, but the site was started by Jim Brown, Diane Davidson, and Jim Reichman in 1977 and many people have been involved in collecting the site’s data. But I argue those expressed fears still apply to my case because when Tom Valone and I become the PIs of the project in the early 2000’s we could have opted to sit on this treasure trove of data instead of publishing it.
4 Worse depends on your point of view, of course. If you don’t want people to use your data, this is good news. I use worse here in the sense that this is bad news for the argument that publishing your data will cause you to lose papers. It is also worse from the perspective that we published this data and no one is using it.
5 72 data packages in total were identified but some of these were under embargo and not yet available
6 This makes Portal look like a rock star! Our raw data (i.e. information had to be extracted using the raw data and could not be obtained from summary statistics in one of our papers) were used in 4 meta-analysis/macroecology papers. That is literally infinitely more used than those other long-term data J
The first, throat clearing post to kick off what (we hope) will be a revitalization of the Portal Project Blog
Updates on temporal community dynamics, and a whole new project scheme.
Things have been quiet on the portal blog lately.
But in the lab and the field, it has been anything but.
Over the past year there have been big changes afoot for the Portal project. In the summer of 2015, Weecology lab headquarters relocated from Utah State University in small, mountainous Logan to the massive University of Florida campus in subtropical Gainesville. So now we study Arizona’s desert rodents from the mossy groves of the southeast rather than the alpine forests of the Rockies, like true cosmopolitan, ever-curious ecologists.
The Portal project headquarters relocated from Utah to Florida in summer 2015. Leaping Krat photo illustration by Molly Zisk, taken from http://www.ocregister.com
If you’re going to box up your life, you might as well reorganize it too. In the midst of planning her transcontinental move, Dr. Morgan Ernest…
View original post 463 more words
[Update: A little bird pointed out I didn’t have a link to the actual Portal blog. That has been remedied along with a link to the Portal Project website for those who’d like more info on the project]
A couple weeks ago, I posted about the new data paper from my long-term field site, the Portal Project. Most of you probably have no idea that there is also a blog associated with the field site that shares stories from the field and observations of interesting things going on down there. You’re forgiven for not knowing it existed because we have not been really good about posting on it for the past year or so. But we’re trying to change that! One of the things we really liked about the blog was that a mix of people used to follow it – some scientists, but also people who lived locally around Portal, AZ or had helped out down there at sometime over the past 30 years and were just curious to know what was going on. My student Joan Meiners, who has a strong interest in science communication, is helping out on the blog with some posts to help kickstart things. We’ll be reblogging some of the posts here as well. If you think they look interesting, click on the reblog to take you to the post on the Portal Blog. If not, just ignore! We’ll also highlight Portal Blog reblogs using the [PortalBlog] tag in the title.
This is the story behind “Comparing process-based and constraint-based approaches for modeling macroecological patterns” by my former PhD student Xiao Xiao, James O’Dwyer, and myself.
I was on sabbatical in the fall of 2013 and was doing a lot of reading, and I reread “An integrative framework for stochastic, size-structured community assembly” by James O’Dwyer, Jessica Green, and colleagues. A couple of months earlier Xiao Xiao, Dan McGlinn, & I had submitted a paper on “A strong test of the Maximum Entropy Theory of Ecology“, where we had tested John Harte and colleague’s new maximum entropy based model by looking at four different predictions of the model simultaneously. In rereading O’Dwyer et al. I realized that their size-structured neutral theory would probably be able to predict a similar set of ecological distributions to those predicted by the maximum entropy model. We’d already conducted the first three levels in McGill’s hierarchy of model testing (see McGill 2003 and McGill et al. 2006) for Harte et al.’s maximum entropy model (checking the general form of the predictions, comparing to null hypotheses, and testing multiple complex predictions) and this would let us complete the last level by comparing the fit to realistic alternative models.
Getting to work
The math in O’Dwyer et al. is pretty advanced and I knew James through shared interests in ecological theory, so I emailed him and Xiao to see if it might it be mathematically tenable to use James’ model to make the same predictions we’d been testing and, if so, if he and Xiao were interested in working together on trying to do this.
What resulted was a very interdisciplinary collaboration, combining shared expertise in mathematical modeling, computing, analysis of large ecological datasets, and knowledge of the foundations of multiple models/theories. It was regular for two of the three people to have a detailed conversation that the third collaborator didn’t follow the details of but always felt comfortable interjecting to make sure that the big picture goals of the project stayed on track. In particular, I remember a \~100 message long email exchange where James and Xiao were working on getting the two theories to make identical predictions. They were on-boarding each other with the details of the two theories and then exchanging ideas in math that I wasn’t even trying to keep up with. I’d occasionally jump in to provide some relevant empirical details and information on other related theory/ideas to help keep things moving in the right direction, but generally just got to watch in awe as two folks with amazing theory skills did their thing. Xiao was constantly running and sharing new analyses which really helped make all of our interactions cohesive by grounding them in graphs and real values.
Reviews, revisions, and the speed of scientific dialog
During the review process John Harte pointed out that there was a second generation model from the maximum entropy theory that was expected to improve the areas where the version we were analyzing was performing poorly. We’d known about this work for a couple of years since we’d been actively sharing ideas and results with the Harte Lab throughout this research. We knew that this paper was already in review, but it didn’t seem like we could reasonably analyze work that they hadn’t made publicly available yet. So, we’d acknowledged in our paper that new models based on this general theory could improve it’s performance and planned to potentially come back later and analyze the new model in a second paper.
Aside: this is a perfect example of the advantages of preprints for facilitating a rapid scientific dialog. If this second generation paper had been posted as a preprint at the time it was initially submitted for review we would have been able to cite and analyze the new theory from earlier on in the process of working on our paper. In fact, we probably wouldn’t have had any choice because good reviewers would have pointed us to the preprint and told us that we needed to address it.
Without a preprint and with the paper still in review we could have easily told the editor that we couldn’t address the new model yet, and in fact the editor explicitly gave us that option. This would have made for a quick and easy acceptance since all the other comments involved only writing, but it arguably wasn’t in the best interests of moving science forward quickly. The new model would either be published first, or shortly after our paper, which would mean that the answer to the overarching question would have been very much up in the air. So, it would be better to add the new model to our analyses, but it would take a lot more work to do so. We would have to implement a new model from scratch, integrate it into our code base, and then rerun all of our fairly time-consuming analyses. Xiao was a newly minted PhD and James was an untenured assistant professor, so the best career strategy for them would have been to just get the paper in as is. This was particularly true for Xiao who was going to have to do the majority of the work getting the new model implemented, so we left the decision in her hands and made it clear that everyone was happy with either choice. She decided that the extra work was worth it to better answer the core question now and not only added the 2nd generation maximum entropy model, but also a more advanced version of the size-structured neutral theory model that she and James had been working on. This also broadened the scope of inference for the paper because we had now evaluated two models from each theory instead of just a single model.
It is with great glee that I can announce the latest release of the Portal Project Database. For those of you who just want to go play with the data – here’s the link to the Data Paper we just published in Ecology.
But I would encourage you to read on, as there is more data-related news below.
But first, a story.
As some of you know, I manage a long-term ecological study: the Portal Project. It was started by Jim Brown, Diane Davidson, and Jim Reichman back in 1977 to study competition and plant/animal interactions. That original team moved on (intellectually) and eventually retired. Tom Valone and I inherited the mantel of responsibility for the site. Jim Brown believed in sharing data with whomever asked for it, and in 2009 we formalized that philosophy by publishing all of the data from 1977-2003 that we felt was in good enough shape to document and share. We chose to release the data as an Ecology Data Paper, using Ecological Archives. Partly that was because I had great previous experiences publishing data through Ecology, and partly because I wanted something permanent. I’ve seen many people talk about their “publically available data” that was either not actually publically available, stored on a now-defunct personal website, or had so many data owner imposed hoops to jump through that it was effectively not public. I wanted the data to be available even if I died (a little grim, I know, but a real consideration when we talk about data archiving).
But we kept collecting data, which meant in 2013 we realized we had an additional 10 years of data we could share. We also had cleaned up and documented additional data that we wanted to add. So we started the process of publishing the next chunk of data. But how should we do this? Should we just add on to the existing Data Paper (assuming Ecological Archives allowed this it would be awkward since the title of the original data paper included the words 1977-2003)? We also decided to add all the graduate students who had been funded to collect the data for the project from 2003-2013, but tracking down people from the 1970s and 80s seemed unfeasible. The short version of the story is that we opted for a separate data paper for 2003-2013, but Ecological Archives wanted a new Data Paper with all the years of data in one place – so that’s what we ended up doing. Our new Data Paper contains all the data in the original Data Paper, plus the new years of data, plus old ant and weather data that we felt we now understood well enough to let loose in the world.
It should come as no surprise to those who follow this blog that we here at Weecology are interested in open science. I love Ecological Archives as a permanent repository1 – the data is safely in the public sphere even if I die, change universities, forget to update my website, or hand the research over to someone who doesn’t share my ideals. But publishing new data papers is a big ordeal that I only want to do every few years. If we want to make data available more rapidly (and we do), we needed another mechanism for delivery to the public.
Thus begins the Portal Project GitHub Database experiment.
What is GitHub?
Github is a web-based repository typically used for version control and management of software projects. We have created a repository on GitHub (https://github.com/weecology/PortalData) where we can create new releases of data after it has undergone our quality control processes. Here’s a screenshot of what this page looks like:
Version 1.0.0 (which is currently available) matches what is available on Ecological Archives and can be reached through this link: https://github.com/weecology/PortalData/releases or by clicking the release button on the main page of the repository (see above).
When will new data be released?
Our aim is to release a new vetted and updated version approximately every 6 months. However, you can also get our most up-to-date data from GitHub. You can find it on the main page (see figure above). As part of this process, we have moved our data entry and quality control processes to center around the Portal Data repository. Yes, that’s right, you’ll be able to access our new data as soon as we’ve entered it from our field datasheets. New data has not gone through the same level of quality control – so user beware. That data will be less stable than the release data.
GitHub met a variety of our data publishing and data management needs. I won’t go into everything here, but the big one is version control. Every time we make a change to the data files, it is documented. This has not been the case in the past. Though we did try to keep records, it relied on someone making a change in the database and then remembering to write it down somewhere. Now with our new setup, any changes will be automatically documented by commit messages (descriptions of changes that accompany any modification to a file on GitHub). It’s also publicly available, so users can use our history of changes as well, maybe to track down why results differ between two different downloads. How can you do this? Select one of the folders in the current repo – let’s randomly pick the rodent folder and look at the history of the rodent data file (Portal_rodent.csv)
This gives you all the commit messages that are associated with changes to this file. Maybe one of these catches your eye. You can see exactly what got changed by clicking on it.
The red shows you a row that has a deletion. The green a row that is “new”.
How do we feel about this shift to GitHub?
We were very nervous about this initially. While the White Lab has some serious Git-Fu skills, the Ernest Lab views itself as field ecologists and GitHub is not exactly intuitive to us. We worried we would screw up the data. We worried we were adding complexity to an already complex quality control process. But so far we are really happy about our new system. By integrating data entry into the data publishing process, it insures that we are always providing updated data, even if we’re slow on official releases. Version control is allowing us to document all the changes being made to the database – and everyone involved with the project has a chance to see the changes and comment on them if they have concerns. And everyone in our group (and now the world) has access to the most up to date data (and can choose between extremely current but still being vetted for errors or less current but more stable and less error prone). We’re not alone in taking this step to using GitHub for data management; other examples of projects that have moved to GitHub include the Biomass and Allometry Database for woody plants (BAAD) and the Open Tree of Life.
I want to end by saying that I don’t currently intend to stop submitting major updates to Ecological Archives or some other permanent repository. What GitHub provides is more transparency on how the data is being managed (both for people within and outside our group) and faster data streaming to other scientists than we’re capable of doing through Ecological Archives. But what it doesn’t do is provide the data in a stable way for ecologists in the future – and that is something we take very seriously! So if you only want to use our data via Data Papers, never fear, you now have all the data through 2013 and more will come eventually. But in the meantime, you might want to check out our data repository.
1 I might love it a little less right now since my data files are ‘Wiley Property’ housed on Wiley servers, but that’s a separate blog post.
We are very exited to announce the newest release of the EcoData Retriever, our software for automating the downloading, cleaning, and installing of ecological and environmental data. Instead of hours or days trying to get complicated datasets like the Breeding Bird Survey ready for analysis, the Retriever lets you simply click a button or run a single command from R or the command line, and your computer does the rest.
It’s been over a year since the last retriever release and there are lots of new features and improvements to be excited about.
- We’ve added 21 new datasets, including major ecological and environmental datasets like eBird, Vertnet, and the Global Wood Density Database, and the PRISM climate data.
- To support all of these datasets we’ve added support for additional data types including greater than memory archive files, and we’ve also improved the ability to control where downloaded files are stored and how they are clustered together.
- We’ve significantly improved documentation and now have a new automatically built documentation site at Read The Docs.
- We’ve also made a lot of under the hood improvements.
This is also the first release that has been overseen by Weecology’s new software engineer, Henry Senyondo. We’re excited to have Henry on the team, and now that he’s around development of both the EcoData Retriever and other lab software projects will be happening more quickly.
A big thanks to the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative for funding this development through Grant GBMF4563 and to the National Science Foundation for funding as part of a CAREER award to Ethan White.
data <- ecoretriever::fetch("BBS")
In a big step forward for allowing proper credit to be provided to all of the awesome folks collecting and publishing data, the journal Global Ecology & Biogeography has just announced that they will start supporting an unlimited set of references to datasets used in a paper.
A growing concern in the macroecological community has been that many papers whose data are used in meta-analyses or data-compilation papers have not been getting citation credit because most journals require these papers to only be listed in the supplemental material (which is not indexed by most indexing services). GEB is proud to support the inclusion of a second list of references within the main paper for all data papers used… To our knowledge, GEB is the first journal in the ecological field to do this. And we’ll be working with Wiley to further improve options in this area.
These references will be included immediately following the traditional references section in both the html and pdf versions of the paper. You can see an example in Olds et al. (2016).
What this means is that when you combine data from dozens or hundreds of studies to conduct a synthetic analysis, you can cite all of the sources in a way that will provide citation credit to those collecting the data1. It also means that scientists using large data compilations can cite the original data sources as well as the compilation itself2.
This is important for encouraging the publication of data, since one of the common reasons that scientists don’t publish data is a lack of credit, and citation only in non-indexed supplementary materials sections is a common concern.
Facilitating proper citation of all data sources is something the community has been requesting and it’s great to see GEB taking the lead in this area. Since Wiley, the publisher of GEB, is the largest publisher of ecology journals, it should be straightforward to implement this new approach widely. If other journals follow GEB’s lead, we will enter a new era where citation of data can be as complete as possible, allowing proper credit to everyone who collects and publishes data.
1GEB will need to make sure that this section gets properly picked up by the indexers, and tweak the presentation as necessary if it isn’t.
2Provided that the compilation provides a method for compiling a citation list of all associated sources.
For the past few years I’ve been involved in a collaboration to put together a broad-coverage life history database for mammals, reptiles, and birds. The project started because my collaborator, Nathan Myhrvold, and I both had projects we were interested in that involved comparing life history traits of reptiles, mammals, and birds, and only mammals had easily accessible life history databases with broad taxonomic coverage. So, we decided to work together to fix this. To save others the hassle of redoing what we were doing, we decided to make the dataset available to the scientific community. While this post started out as a standard “Hey, check out this new publication from our group” post (Here it is, by the way: Myhrvold, N.P., †E. Baldridge, B. Chan, D. Sivam, D.L. Freeman, S.K.M. Ernest. 2015. An Amniote Life-history Database to Perform Comparative Analyses with Birds, Mammals, and Reptiles. Ecology 96:3109), I’ve realized that there’s something more important that needs to be discussed: what is the future of trait databases?
Trait databases are all the rage these days, for good reason. Traits are interesting from evolutionary and ecological perspectives: How and why do species differ in traits, how do traits evolve, how quickly do traits change in response to changing environment, and what impacts do these differences have on community assembly and ecosystem function. They have the potential to link individual performance with local, regional, and even global processes. There’s lots of trait data out there, but most of it has been buried in papers, books, theses, gray literature, field guides, etc. This has led to the explosion of compendiums compiling trait data. Some of these are published as Data Papers (e.g.: Mammals: Jones et al 2009 , Plankton: Kremer et al 2014) or on-line databases (e.g. AnAge, FishBase), which are open for everyone to use. Many of these open datasets are generated by a small number of scientists to address some particular question. Some are quasi-open/quasi-private resources generated by consortiums of scientists (TRY).
There are a variety of issues regarding these trait compendiums, not least of which is these trait compendiums pull data from numerous sources, but how do data generators get credit and what type of credit is reasonable? This is a doozy that I don’t have an answer to. Instead, my focus today is on the eventual endgame of trait databases. No trait database currently being produced has all the trait data of interest for every species. This means we have a bunch of incomplete data products running around. So, every few years, a bigger – more complete, but still incomplete – trait dataset is produced for some group of species. Sometimes the bigger dataset replicates the effort of the smaller one, sometimes it incorporates the smaller compilation whole-cloth, sometimes they have little overlap in sources whatsoever. Data compilations vary in the ease of use and accessibility. Some databases are widely known, some are known only to a few insiders. I could keep going. Clearly this state of affairs is less than optimal for rapid progress in studying traits.
So what’s the end game here? What should we be doing? In my opinion, what we need is a centralized trait database where people can contribute trait data and where that data is easily accessible by anyone who wants to use it for research (not just to the contributing members of the database). It would also be nice if people who contribute significant amounts of data (no, I’m not going to define that here) could get specific credit for that contribution – maybe as a Data Paper or E-Publication. To encourage people to not just download data, add to it, and then sit on the expanded dataset, embargoes could be put in place to allow people to add their data to the dataset but have the data protected for a limited period of time to allow that researcher to get first crack at the publications using that entry. It’d be really nice if people who use the database could easily download all the references for the data they used so it can be easily incorporated into a literature cited section. The central database could get credit (let’s face it, it needs to be able to justify the funding that such an endeavor would require) by having people register papers published using data from the database. They could then keep track of numbers of pubs and citations to those pubs to help track the database’s impact.
Right about now, my Paleo brethren may be thinking “this sounds suspiciously familiar”. I’ve pretty much lifted this list right off of the Paleobiology Database website (https://paleobiodb.org/#/faq). While ecologists have been running our every database for itself experiment on Trait Databases, the Paleobiologists have been experimenting with collaborative open databases for fossil records. I’m an outsider, so I don’t really know how the database is perceived within the paleo community, but from the outside I have been a big fan of the database, the work that has emerged from its existence, and the community that surrounds it. Which is why I’ve wondered if ecology could some something similar.
But if we’re going to do this, I think we need to copy something else from the Paleobiology Database: a focus on individual records. Currently, many trait databases focus on a species-level value; what is the average number of offspring per litter? Seed Mass? Average body size? This is a logical place to start building a database if many of the questions are focused on comparing central tendencies across species. But our understanding of traits and the questions we want to ask have evolved. Having any info is still better than no info, but often we need info on variability across individuals within a species or we want to know how the trait might vary with changes in the environment. For this, we need record-level data. By this, I mean that instead of pooling observations to obtain an average for a species, we now often want to know that the average litter size for a species at location X is 3 but 8 at location Y. For some species, traits are especially sensitive to temperature or some other environmental variable – so knowing if the body size was measured at 28C or 32C can be important. This data could then be summarized in whatever way the user needed (species-averages, region-specific averages, etc). This, of course, is the hard part, because while we have an increasing number of trait compilations, they have either jettisoned the record information, or little of the record info is associated with the datapoint except maybe the citation name (I say this knowing I’m guilty of this). It also involves doing some form of georeferencing if we want the location info to be useable (like they’ve been doing for museum records). This means we would need to basically uncompile the compilations – find the original citations, extract as much info as we can from them, and then re-enter them as part of a more sophisticated database. This is an extraordinary amount of work that (to be clear) I am not volunteering for.
There are undoubtedly some in the trait community who are about to explode because they’ve been thinking “but we’re doing what you are talking about!”. There are indeed already some bigger initiatives out there (AnAge, FishBase, TRY) but they are either not community-based (i.e. run by a closed group), taxon-centric, or a nightmare of open and closed policies that make extracting data needlessly burdensome, or some unfortunate combo of the above. The one that seems closest to the Paleobiology Database model is TraitBank at the Enyclopedia of Life. Its goal, however, is different from the record-based trait database that I outlined above. Its goal is to have a webpage (and trait data) for every species on the planet, so this still seems to be a species average approach. As I mentioned before, some info is better than no info, so this alone would be a huge benefit to trait research, but still carries the restrictions of species-average values. On the plus side, data in the database is available for everyone to use and each data entry has the specific reference listed with it. But I don’t think it’s had broad buy-in from the trait community. TraitBank only lists 50 data sources and 327 “content partners” (websites/databases that have agreed to share their data via Encyclopedia of Life pages). Admittedly, these sources are some of the biggest data aggregations around, but it’s inconceivable that they cover the wide array of trait info for all of life. Without broad buy-in from the trait community, both using it for research and contributing their data to it, I don’t see this working in the way I’ve outlined above.
So where does this leave us? Well, things are currently in a muddle with respect to trait data, but there’s also tremendous opportunity for someone who can envision the type of database the field needs, sell broad swaths of the trait data community on its importance, and figure out how to build both the database and the community to support and use it. This may involve better community buy-in with TraitBank and/or some new initiative working on a record-level product that would allow a finer-level of question to be asked. The question is how does this happen and is there enough will in the trait community to give up on the current idiosyncratic ad hoc approach and contribute to something with broad trait and taxonomic coverage with an open data policy?