From how we do science to publishing practices to the sociology of science, there isn’t an aspect of the scientific endeavor that isn’t in flux right now. Long-term readers of Jabberwocky know that understanding how the scientific endeavor is changing and figuring out how to maximize the good for science and minimize the bad is a bit of an obsession for us. Ethan has been a tireless proponent (or “a member of the radical fringe” as one former ESA president called him) for changes in scientific publishing and reproducibility. For me, the issue close to my heart is data availability. For me, this is a “for the good of science” issue. By definition, science relies on data. If data is stuck in a drawer and no one knows about it or it becomes technologically inaccessible (I’m looking at you 5 1/4” floppy disk) then it effectively does not exist for the scientific endeavor. At best it is knowledge that needs time and resources (neither of which we have in abundance) to reacquire. At worst that is knowledge lost forever.
But publishing one’s data for the world to use is not the ecology way – in part because extracting data from nature is hard. Much of the resistance is because scientists are afraid they will lose out on credit for that hard work. If they are the one to publish on that data, they get the credit. If someone else does, the credit is less. Regularly, I see a journal article, or a tweet, or a blog post worried about the increasing push to make data publicly available. Most of these just make me sad for science, but one in particular has haunted me for a while because it focused on something near and dear to my heart: Long-Term Data. A paper published by Mills et al. in TREE last year argued that sharing long-term data will kill off long-term studies. This paper conducted a survey of 73 Principal Investigators running long-term ecology projects. Almost all said they were in favor of sharing data “with the agreement or involvement of the PI”. Only 8% were in favor of “open-access data archiving”1. 91% supported data-sharing when there were clear rules for how/when the data would be shared. Suggestions for rules included “(i) coauthorship or at least acknowledgment, depending on the level of PI involvement; (ii) no overlap with current projects, particularly projects conducted by students or postdoctoral fellows; and (iii) an agreement that the data go no further than the person to whom they are entrusted.”
My colleagues were so against open-access data archiving that many said they would rather submit their science to a less high profile journal if publishing in a more high profile journal required them to archive their data2. The paper argues that this type of decision making will result in less impactful papers and harm the careers and funding opportunities for scientists studying long-term ecology. Fears were expressed that flawed science would be produced without the input of the PIs, that being scooped would damage the careers of their trainees, concern that time would get wasted due to redundant analyses being conducted as multiple labs do the same analyses, a reduction in the number of long-term studies due to lower incentives for conducting this type of science, less collaboration, and lost opportunities to obtain new funding because research is being done by other groups. You get the idea. Sharing data results in lost opportunities to author papers with cascading consequences.
Having just published the next installment of the Portal Project long-term data and begun our on-line data streaming experiment, it seems like an ideal time to talk about my experiences and concerns with sharing long-term data. After all, unlike many people who have expressed these fears, my raw data3 has been openly available since 2009. How calamitous has this experience been for me and my students?
Since the database was published in 2009, it has been cited 16 times. That’s about 2ish papers a year, not exactly an impressive flurry of activity – though you could argue that it would still be a significant boost to my productivity. But the picture changes when you look at how exactly the data are being used. Of those 16 citations, 4 cite the Data Paper to support statements about long-term data being published, 4 use the data as one of many datapoints as part of a meta-analysis/macroecological study, 3 use the data to plot a data-based example of their idea/concept/tool (i.e. the data is not used as part of an analysis that is being interpreted), 3 use the data to ask a scientific question focused on the field site, 1 cites the data paper for a statement in the metadata about the importance of experimental manipulations and 1 cites it for reasons I cannot ascertain because I couldn’t access the paper but the abstract makes it clear they are not analyzing our data. No one is going to add me a co-author to cite the existence of long-term data, cite statements made in the metadata, or make an example figure, so we’re down to 7 papers that I “lost” by publishing the data. But it gets even worse4. I am already a co-author on the 3 papers that focus on the site. So, now we’re down to 4 meta-analysis/macroecological studies. As someone who conducts that type of research I can tell you that I would only need to include someone as an author to get access to their data if I desperately needed their particular data for some reason (i.e. location, taxa, etc) or if I can’t get enough sites to make a robust conclusion otherwise. There is a lot of data available in the world through a variety of sources (government, literature, etc). Given the number of studies used in those 4 papers, if I had demanded authorship for use of my data, my data would probably not have been included.
Final tally: We published one of the few (and among the longest-term) long-term datasets on climate, plants, and consumers for a single site in existence. This dataset even includes a long-term experimental manipulation (the ‘gold standard’ of ecology). That data has been openly available with no limitations for 7 years. I cannot yet point to an instance where someone used the data in a way that has cost us a paper – either through scooping us or because if the data had not been available they would have been forced to collaborate with me.
In fairness, I don’t expect that to be true forever, but right now our challenge isn’t how to avoid being scooped, it’s how do we get anyone to use our data!!! When I talk about this data online or give talks, invariably people tell me “you are so lucky to have a data set like that”. My response: The entire world has a data set exactly like this because we published it! Not one of those people has published anything with it.
My experience is not unique. A response paper was published this year by Simon Evans examining this lost opportunity cost of making long-term data publicly available. Evans combed through Dryad (the data archival site for many ecology and evolution journals) to see how many times long-term data archived on the site had been used. Using a 5 year or more definition for long-term data, there were 67 publicly accessible datasets on Dryad.5 How often had these data been used? Using citations to the data package, examining citations to the original paper associated with the data package, and contacting the data authors to see if they knew of instances of their data being used, Evans found that there were 0 examples of deposited data being reused by investigators not authors on the original study.5
Most people I know who are looking for data often forget about Dryad, so maybe Dryad just hasn’t been ‘discovered’. I would be interested to know how Evans’ result compares to data being downloaded from Ecological Archives. But given our experience with the open Portal Project data on Ecological Archives, I suspect differences in long-term data usage between different repositories is small.
So, currently there is no evidence that publishing long-term data results in the negative impacts based on the Mills et al paper. Does it have a positive impact? For long-term data, I’m currently unclear because so few people seem to be using that type of data. But I have published some macroecological data sets in the past (Ernest 2003, Smith et al 2003, Myhrvold et al 2015) and there have definitely been positives for me. No, it has not resulted in papers for me, but I have also not been scooped on anything I was actively working on or even seriously interested in pursuing. But they have resulted in a fair number of citations (397 to date via Google Scholar), they contribute to my h-index (which universities use to judge me), and have definitely contributed to my name recognition in that area of ecology. (I have been tongue tied on more than one occasion when a big name walked up to me and thanked me for publishing my 2003 mammal life history dataset). No, these aren’t publications, but name recognition is ever harder to obtain in the increasingly crowded field of science, and citation and impact metrics (for better or worse) are increasingly a part of the assessment of scientists. So yes, I believe that publishing datasets in general has been a net positive for me.
Finally, I can also say that NSF is watching. At the end of my previous NSF grant supporting the research at the field site, my program officer was in my ear reminding me that as part of that grant I needed to publish my data. I don’t know how other people feel about this, but I feel that a happy NSF is a positive.
So, in my experience, publishing my long-term data has not resulted in the grand implosion of my research group. If anything, I think the relative dearth of activity using the long-term data – especially in comparison to the macroecological datasets – suggests that very few people are actually using long-term data. To me, this lack of engagement is much more dangerous for the continuation of funding for long-term ecology than the nebulous fears of open data. If people don’t actively see how important this type of data is, why would they ever recommend for it to be funded? Why prioritize funding long-term data collection – a data type most ecologists have never used and don’t understand – over an experiment which most ecologists do and understand. We need more advocates for long-term ecology and I don’t believe you can do that by tightly controlling access so only a lucky few have access to it. So if you’re wondering why we now stream the data nearly live on GitHub, or why we make the data available to be used as a teaching database for DataCarpentry, that’s why. Long-term datasets – and a large body of scientists who understand how to work with them – are going to be important in tackling questions about how and why ecosystems change through time (not just in the past but into the future). This makes increasing the number of people working with long-term data a win for science – and in the long-run I believe it will be a win for me and others who invest so much blood, sweat and tears into generating long-term data.
1 I have never had quantitative evidence before that I was not “normal”. My reaction was to give myself a pat on the back, but I couldn’t figure out if the pat was consolatory or congratulatory.
2 This language is drawn from the original paper and does not reflect my opinions on “high” vs “low” impact journals.
3 In the spirit of complete disclosure, this data is not exactly mine. I’ve collected a lot of it since 1995, but the site was started by Jim Brown, Diane Davidson, and Jim Reichman in 1977 and many people have been involved in collecting the site’s data. But I argue those expressed fears still apply to my case because when Tom Valone and I become the PIs of the project in the early 2000’s we could have opted to sit on this treasure trove of data instead of publishing it.
4 Worse depends on your point of view, of course. If you don’t want people to use your data, this is good news. I use worse here in the sense that this is bad news for the argument that publishing your data will cause you to lose papers. It is also worse from the perspective that we published this data and no one is using it.
5 72 data packages in total were identified but some of these were under embargo and not yet available
6 This makes Portal look like a rock star! Our raw data (i.e. information had to be extracted using the raw data and could not be obtained from summary statistics in one of our papers) were used in 4 meta-analysis/macroecology papers. That is literally infinitely more used than those other long-term data J