From how we do science to publishing practices to the sociology of science, there isn’t an aspect of the scientific endeavor that isn’t in flux right now. Long-term readers of Jabberwocky know that understanding how the scientific endeavor is changing and figuring out how to maximize the good for science and minimize the bad is a bit of an obsession for us. Ethan has been a tireless proponent (or “a member of the radical fringe” as one former ESA president called him) for changes in scientific publishing and reproducibility. For me, the issue close to my heart is data availability. For me, this is a “for the good of science” issue. By definition, science relies on data. If data is stuck in a drawer and no one knows about it or it becomes technologically inaccessible (I’m looking at you 5 1/4” floppy disk) then it effectively does not exist for the scientific endeavor. At best it is knowledge that needs time and resources (neither of which we have in abundance) to reacquire. At worst that is knowledge lost forever.
But publishing one’s data for the world to use is not the ecology way – in part because extracting data from nature is hard. Much of the resistance is because scientists are afraid they will lose out on credit for that hard work. If they are the one to publish on that data, they get the credit. If someone else does, the credit is less. Regularly, I see a journal article, or a tweet, or a blog post worried about the increasing push to make data publicly available. Most of these just make me sad for science, but one in particular has haunted me for a while because it focused on something near and dear to my heart: Long-Term Data. A paper published by Mills et al. in TREE last year argued that sharing long-term data will kill off long-term studies. This paper conducted a survey of 73 Principal Investigators running long-term ecology projects. Almost all said they were in favor of sharing data “with the agreement or involvement of the PI”. Only 8% were in favor of “open-access data archiving”1. 91% supported data-sharing when there were clear rules for how/when the data would be shared. Suggestions for rules included “(i) coauthorship or at least acknowledgment, depending on the level of PI involvement; (ii) no overlap with current projects, particularly projects conducted by students or postdoctoral fellows; and (iii) an agreement that the data go no further than the person to whom they are entrusted.”
My colleagues were so against open-access data archiving that many said they would rather submit their science to a less high profile journal if publishing in a more high profile journal required them to archive their data2. The paper argues that this type of decision making will result in less impactful papers and harm the careers and funding opportunities for scientists studying long-term ecology. Fears were expressed that flawed science would be produced without the input of the PIs, that being scooped would damage the careers of their trainees, concern that time would get wasted due to redundant analyses being conducted as multiple labs do the same analyses, a reduction in the number of long-term studies due to lower incentives for conducting this type of science, less collaboration, and lost opportunities to obtain new funding because research is being done by other groups. You get the idea. Sharing data results in lost opportunities to author papers with cascading consequences.
Having just published the next installment of the Portal Project long-term data and begun our on-line data streaming experiment, it seems like an ideal time to talk about my experiences and concerns with sharing long-term data. After all, unlike many people who have expressed these fears, my raw data3 has been openly available since 2009. How calamitous has this experience been for me and my students?
Since the database was published in 2009, it has been cited 16 times. That’s about 2ish papers a year, not exactly an impressive flurry of activity – though you could argue that it would still be a significant boost to my productivity. But the picture changes when you look at how exactly the data are being used. Of those 16 citations, 4 cite the Data Paper to support statements about long-term data being published, 4 use the data as one of many datapoints as part of a meta-analysis/macroecological study, 3 use the data to plot a data-based example of their idea/concept/tool (i.e. the data is not used as part of an analysis that is being interpreted), 3 use the data to ask a scientific question focused on the field site, 1 cites the data paper for a statement in the metadata about the importance of experimental manipulations and 1 cites it for reasons I cannot ascertain because I couldn’t access the paper but the abstract makes it clear they are not analyzing our data. No one is going to add me a co-author to cite the existence of long-term data, cite statements made in the metadata, or make an example figure, so we’re down to 7 papers that I “lost” by publishing the data. But it gets even worse4. I am already a co-author on the 3 papers that focus on the site. So, now we’re down to 4 meta-analysis/macroecological studies. As someone who conducts that type of research I can tell you that I would only need to include someone as an author to get access to their data if I desperately needed their particular data for some reason (i.e. location, taxa, etc) or if I can’t get enough sites to make a robust conclusion otherwise. There is a lot of data available in the world through a variety of sources (government, literature, etc). Given the number of studies used in those 4 papers, if I had demanded authorship for use of my data, my data would probably not have been included.
Final tally: We published one of the few (and among the longest-term) long-term datasets on climate, plants, and consumers for a single site in existence. This dataset even includes a long-term experimental manipulation (the ‘gold standard’ of ecology). That data has been openly available with no limitations for 7 years. I cannot yet point to an instance where someone used the data in a way that has cost us a paper – either through scooping us or because if the data had not been available they would have been forced to collaborate with me.
In fairness, I don’t expect that to be true forever, but right now our challenge isn’t how to avoid being scooped, it’s how do we get anyone to use our data!!! When I talk about this data online or give talks, invariably people tell me “you are so lucky to have a data set like that”. My response: The entire world has a data set exactly like this because we published it! Not one of those people has published anything with it.
My experience is not unique. A response paper was published this year by Simon Evans examining this lost opportunity cost of making long-term data publicly available. Evans combed through Dryad (the data archival site for many ecology and evolution journals) to see how many times long-term data archived on the site had been used. Using a 5 year or more definition for long-term data, there were 67 publicly accessible datasets on Dryad.5 How often had these data been used? Using citations to the data package, examining citations to the original paper associated with the data package, and contacting the data authors to see if they knew of instances of their data being used, Evans found that there were 0 examples of deposited data being reused by investigators not authors on the original study.5
Most people I know who are looking for data often forget about Dryad, so maybe Dryad just hasn’t been ‘discovered’. I would be interested to know how Evans’ result compares to data being downloaded from Ecological Archives. But given our experience with the open Portal Project data on Ecological Archives, I suspect differences in long-term data usage between different repositories is small.
So, currently there is no evidence that publishing long-term data results in the negative impacts based on the Mills et al paper. Does it have a positive impact? For long-term data, I’m currently unclear because so few people seem to be using that type of data. But I have published some macroecological data sets in the past (Ernest 2003, Smith et al 2003, Myhrvold et al 2015) and there have definitely been positives for me. No, it has not resulted in papers for me, but I have also not been scooped on anything I was actively working on or even seriously interested in pursuing. But they have resulted in a fair number of citations (397 to date via Google Scholar), they contribute to my h-index (which universities use to judge me), and have definitely contributed to my name recognition in that area of ecology. (I have been tongue tied on more than one occasion when a big name walked up to me and thanked me for publishing my 2003 mammal life history dataset). No, these aren’t publications, but name recognition is ever harder to obtain in the increasingly crowded field of science, and citation and impact metrics (for better or worse) are increasingly a part of the assessment of scientists. So yes, I believe that publishing datasets in general has been a net positive for me.
Finally, I can also say that NSF is watching. At the end of my previous NSF grant supporting the research at the field site, my program officer was in my ear reminding me that as part of that grant I needed to publish my data. I don’t know how other people feel about this, but I feel that a happy NSF is a positive.
So, in my experience, publishing my long-term data has not resulted in the grand implosion of my research group. If anything, I think the relative dearth of activity using the long-term data – especially in comparison to the macroecological datasets – suggests that very few people are actually using long-term data. To me, this lack of engagement is much more dangerous for the continuation of funding for long-term ecology than the nebulous fears of open data. If people don’t actively see how important this type of data is, why would they ever recommend for it to be funded? Why prioritize funding long-term data collection – a data type most ecologists have never used and don’t understand – over an experiment which most ecologists do and understand. We need more advocates for long-term ecology and I don’t believe you can do that by tightly controlling access so only a lucky few have access to it. So if you’re wondering why we now stream the data nearly live on GitHub, or why we make the data available to be used as a teaching database for DataCarpentry, that’s why. Long-term datasets – and a large body of scientists who understand how to work with them – are going to be important in tackling questions about how and why ecosystems change through time (not just in the past but into the future). This makes increasing the number of people working with long-term data a win for science – and in the long-run I believe it will be a win for me and others who invest so much blood, sweat and tears into generating long-term data.
1 I have never had quantitative evidence before that I was not “normal”. My reaction was to give myself a pat on the back, but I couldn’t figure out if the pat was consolatory or congratulatory.
2 This language is drawn from the original paper and does not reflect my opinions on “high” vs “low” impact journals.
3 In the spirit of complete disclosure, this data is not exactly mine. I’ve collected a lot of it since 1995, but the site was started by Jim Brown, Diane Davidson, and Jim Reichman in 1977 and many people have been involved in collecting the site’s data. But I argue those expressed fears still apply to my case because when Tom Valone and I become the PIs of the project in the early 2000’s we could have opted to sit on this treasure trove of data instead of publishing it.
4 Worse depends on your point of view, of course. If you don’t want people to use your data, this is good news. I use worse here in the sense that this is bad news for the argument that publishing your data will cause you to lose papers. It is also worse from the perspective that we published this data and no one is using it.
5 72 data packages in total were identified but some of these were under embargo and not yet available
6 This makes Portal look like a rock star! Our raw data (i.e. information had to be extracted using the raw data and could not be obtained from summary statistics in one of our papers) were used in 4 meta-analysis/macroecology papers. That is literally infinitely more used than those other long-term data J
I’m not surprised to hear that basically nobody’s made use of the Portal data, whereas lots of people have made use of the comparative datasets you’ve published. It’s my impression that people who are looking to download and analyze data collected by others are mostly looking to do comparative analyses of some sort. Insofar as the Portal dataset is unique or nearly so, there’s little to compare it to. And it only covers one site and doesn’t cover a gazillion species, so you can’t do much comparative analysis just using the Portal data. So I don’t know that the lack of citations to the Portal dataset reflects lack of interest in long-term data. Plenty of people are interested in long-term data–think for instance of the BBS data, which I’m sure are very heavily used. But the BBS data come from many species and sites, facilitating comparative analyses.
Plus, the distribution of attention paid to anything is highly skewed. Few papers get highly cited, few books sell millions of copies, few movies become hits, few blogs have many readers, etc. Most shared datasets are going to get downloaded or cited approximately zero times, whereas a few will get downloaded or cited approximately a zillion times. So I don’t know that I’d read much into the Portal dataset not getting used much.
I leave it to others to ask if we’re missing opportunities or otherwise behaving suboptimally if we use shared data only for comparative analyses…
Sorry you felt like the post was me needing reassurance because my data set is unloved. That wasn’t really the point (I feel very good about my data). The point is exactly what I said it was: the fears expressed by other long-term data generators about open sharing of long-term data does not seem to be warranted. In that light, I think you make a bunch of points that are very important and had occurred to me as well (I have toyed with a post titled: your data (on its own) isn’t as valuable as you think it is, but I can’t figure out how to write it in a positive way). I would agree with you completely that people who want to use existing data are mainly those who want to do comparative work. I think that comes through in the list of how the Portal dataset has been used. No one but us has done a study using the site on its own. I think that’s because to do a study focused on a single site, you may need to have a lot of knowledge of the site (depending upon the question and the results) and that’s a higher activation energy for working with the data. But to do comparative work, the idea is that all the idiosyncrasies of the sites comes out in the wash. So you don’t necessarily need that site specific insight.
I think your point that people who do comparative work would prefer an existing compilation (or dataset that already has a suite of sites like BBS) rather than piecing it together one time series at a time is also very true. Piecing together a bunch of sites that are comparable, one at a time from the literature, is time consuming (I know. I’ve done it). If we want to make it easier for people to do comparative long-term ecology, we need a database that has gathered and ‘scrubbed’ the data so that it is roughly comparable. (and fortunately, I know some people are doing this!). But on it’s own, any particular time series is less likely to be used because of the activation energy needed to work with it. Hence my unwritten blog post: your data is not as valuable (on it’s own) as you think it is. I think most people overestimate how many people are salivating to analyze their data and thus overestimate the ‘damage’ sharing it might cause. Which was a main theme of my post, though you came at it from a different direction.
Finally, long-term ecology in general is a less used approach for studying ecology than say….experiments (I think even you wouldn’t argue with me on that 🙂 ) If few people are studying long-term ecology then even fewer people are going to be interested in doing comparative work on it. I think it would be interesting to redo the Evans analysis but with short-term experimental data and see if there is a difference in usage. Do you also see 0 use of the data from those studies?
> Plenty of people are interested in long-term data–think for instance of the BBS data, which I’m sure are very heavily used.
I would say that while the BBS data is heavily used, the time-series aspect of it is used far less than the spatial aspects of the data.
Its always great to see real data (even anecdata) on this topic in stead of the sort of unsubstantiated sweeping claims that are made so often.
I increasingly think the notion of a short (1-2 year embargo) is a compromise that could help build momentum toward sharing data. Apparently this is the norm in astronomy – you have to put your data in a repository when you publish, but you can put an embargo. Whenever I ask people whether they really think they have lots of papers and questions that they’re going to start 3 years down the road, they admit the answer is no.
While I fully support and lean to full and immediate data openness myself (but also recognize this is easy for me to say as I’m not a primary data collector), it seems like embargoed data could be a halfway house that could build momentum towards open data. I’m curious what you think about this?
Also, I hear a lot more often these days “I don’t want somebody interpreting my data without me involved”.. The holes in this argument seem big to me. It is true some people will do careful science with your data and some people will do sloppy science. But that doesn’t really change anything. What are your thoughts on what this argument is really saying and how to respond?
“Finally, long-term ecology in general is a less used approach for studying ecology than say….experiments (I think even you wouldn’t argue with me on that)”
Well, some of us work in microcosms precisely so that we don’t have to choose between doing experiments and doing long-term work. 🙂
“your data is not as valuable (on it’s own) as you think it is.”
Absolutely. I confess that I find datasharing requirements slightly annoying, because I know my own data isn’t that valuable to anyone but me. Nobody besides me is likely to use the data I collect. (To be clear, I’m not annoyed enough by datasharing requirements to think they shouldn’t exist…)
I predict that the strongest datasharing advocates would be as happy with embargoes on data as the strongest open access publishing advocates are with making papers in subscription-based journals open access after a 6- or 12-month embargo period.
…it seems like embargoed data could be a halfway house that could build momentum towards open data. I’m curious what you think about this?
I find embargoes curious. Do they actually work to help prevent scooping while you finish up follow-on projects or are they just a psychological salve? Say I submit a paper and I have follow up project I want to do. For me, either those projects have already been initiated (maybe early phase, but still they are on the active docket) or they are on the “I’ll do some day” pile. If they are on the “I’ll do some day” pile, the truth is I probably won’t ever get to it, much less in 1-3 years. If it’s already in progress than I’m ahead of anyone else who might want to do that analysis when my original paper comes out (typically what, 6 months to a year or more after I first submit it?). So I don’t think embargoes protect much and just slow things down, BUT in the spirit of bipartisan cooperation, I would gladly sign on to a bill that required depositing data but allowed a 1-3 year data embargo.
I hear a lot more often these days “I don’t want somebody interpreting my data without me involved”…What are your thoughts on what this argument is really saying and how to respond?
Yes, I’ve run across this one too, and unlike the embargo one I’m not sure there’s a great way to diplomatically handle that one. When you invest so many years into a site, you develop a stong relationship with it. I understand feeling uncomfortable with the idea of someone else taking your study site out on a date. Would I initially feel unsettled if I opened up a table of contents and saw someone had written a Portal-focused paper without ever contacting me or Tom or Jim? Yes. Would I be aggravated if it was stupid/flawed? Yes. Should I have had veto power over it? No. After working at the site for so many years, I know it better than anyone but Tom Valone and Jim Brown. But because of that I also come with preconceptions and biases. I think I know how the system works – even if we don’t have definitive proof that it actually works that way. I also have some level of investment in the results that have been generated before and may be less open to new analyses that suggest I was wrong (I hope this isn’t the case, but I’m human). So there are pros and cons for having a site expert involved. Having fresh eyes view the data, without that baggage allows for new interpretations, new ways of thinking about the data, and new concepts to get a fair shake. The con is that newcomers lack the history (both what has been found scientifically before and often the natural history) of the site and may do things that don’t make sense (combine controls and experimental plots for a question that should not be addressed that way, include species from different feeding guilds, etc). So, my feeling is that while I understand the unease, it is best that site experts don’t have veto power over they type of science that gets produced. Having said this, my personal philosophy is that if I am going to focus exclusively on analyzing one site (and it’s not one I’ve worked at) that to do the best science, I should interact with a site expert to make sure I’m not off base. Maybe they have useful comments on what data to include or exclude. Maybe they have useful advice on interpretation. Maybe they think it’s a great analysis. Maybe they hate what you’ve done. Regardless, I think you get useful feedback from the exchange. I have written one paper that focused exclusively on a site that was not mine (i.e. was not a comparative analysis) and we did interact extensively with a site expert on it. The exchange definitely helped us dot our i’s and cross our t’s. Not sure any of this helps you with how to respond the next time someone says this to you, though!
“I know my own data isn’t that valuable to anyone but me”
This is a very common view, but I think it’s actually wrong in one important way. I would rephrase this as “my data, *in isolation*, isn’t that valuable to anyone but me”. The “*in isolation*” part is why data sharing requirements are so important. The kind of work I do often involves compiling data from dozens or hundreds of sources to either do some kind of a meta-analysis or combine information from lots of different sources to produce an integrative model. If I can readily access the data I can immediately get to work on the hard work of getting it all integrated. If instead, the starting point is to email 25 people and try to negotiate access to each of the individual datasets, the science becomes much more difficult to accomplish. We did the later for a bunch of the data in Xiao et al. 2015 and I’d estimate that between Xiao and I we had >100 hours just in email discussions about getting access to and use restrictions on data.
“I would rephrase this as “my data, *in isolation*, isn’t that valuable to anyone but me”.”
Good point. Though when it comes to my data specifically, it’s hard for me to imagine it being valuable to anyone else even if combined with other data. Just because there *are* no other data to combine with mine; I mostly do pretty unique experiments. Hard for me to imagine enough other people ever doing enough sufficiently-similar experiments that a meta-analysis would ever be possible, and hard for me to imagine my data ever turning out to have any other use. But perhaps this just represents a failure of imagination on my part.
And there are exceptions: one BEF experiment of mine has been used in a couple of (very small) meta-analyses. Though that’s kind of the exception that proves the rule, because it’s a rare case in which I ran an experiment in microcosms that others had also run in other systems.
I agree that there’s some value in embargoes as a method for overcoming (the generally unjustified) fear of getting scooped. One year is something I’m generally comfortable with. That said, I think in terms of Morgan’s argument that the need for attention and engagement is much greater than any risks, I think that choosing to embargo data that you collect significantly decreases the likelihood that it will end up being used since folks have short memories.
Morgan’s response to “I don’t want somebody interpreting my data without me involved” is really excellent. Coming from the none field side I also think it’s interesting that I hear the same argument made, not infrequently, by statisticians and software developers. Basically “Researchers shouldn’t use these advanced techniques without an expert involved”. I feel the same way about both sets of statements. Our goal should be to make working with both data and analysis tools as accessible as possible through good documentation and sensible defaults. People will still misuse them, but then we try to solve that through a combination of education and improving the data/tools.
I’m sure there’s data out there that isn’t all that useful for answering questions other than the one it was designed to answer. However, I’ve seen enough examples where I would have guessed that to be true, but the data turned out to have secondary uses, that I think it’s hard to predict in advance. For example, I think that Supp & Ernest 2014 is a nice example of combining a bunch of studies that initially seemed unlikely to be reusable. They were actually constrained in their analyses by the lack of availability of raw data for many of the papers. This is why I think there’s value in archiving everything so that it’s there if and when we end up finding something to do with it. Plus, just think, when your experiments become world famous then undergraduates can re-analyze your data as part of their ecology curriculum :).
So what do you think of the argument (which I seem to recall Margaret Kosmala making) that datasharing requirements contribute to “death by a thousand cuts”? The argument, as I understand it, is that increasingly, scientists are asked/expected/obliged to do things that are unlikely to be of much benefit to them personally, on the grounds that they’ll be beneficial to science as a whole, or perhaps that there’s merely some non-zero chance that they’ll be beneficial to science as a whole. Taken singly, the things we’re being asked to do aren’t very costly to us as individuals–it doesn’t take *that* much time for me to prepare a single dataset for uploading to Dryad, for instance. But collectively, the costs to any given individual, and to scientists collectively, add up.
One could of course argue that this is precisely the case *for* requiring people to do certain things. Few people will willingly do something that’s even slightly costly to themselves on the small chance that it might turn out to be beneficial to science as a whole, or on the even smaller chance it might benefit them personally. But the question still remains of how you weigh costs and benefits, to both individuals and to science as a whole.
As I said, I don’t personally subscribe to this “death by a thousand cuts” argument against datasharing requirements. If only because I don’t track or guard my time sufficiently closely or jealously to be all that bothered by small time obligations, individually or in aggregate. But I think it’s an argument worth taking seriously, while also thinking it’s a difficult argument to evaluate. How on earth could you measure costs and benefits here in any vaguely-objective way?
One response to this is of course to try to find a way to give people benefits when their shared data do end up getting used. Another response is to try to change professional norms so that nobody thinks of datasharing in terms of benefits and costs. People just do it because it’s the right thing to do, or maybe without even thinking about why. Like how nobody thinks of “not falsifying data” in terms of benefits and costs–falsifying data is just something you shouldn’t do, and it wouldn’t ever occur to most people that falsifying data is even an option. I don’t have any strong views on these two responses–the whole topic isn’t one I’ve thought much about, honestly–except to note that they’re in tension with one another.
A third response is to grant the argument and follow up by saying that those who benefit most from using data collected by others should pay the costs of obtaining it. For instance, one could argue that since you and Xiao benefitted a lot from publishing that (very nice!) paper in a high-impact venue, it’s right and proper that you had to pay the (considerable!) costs of chasing down the data by doing 100 hours worth of emailing. Rather than the costs of getting you those data in a usable form being “hidden” by being redistributed and split among a whole bunch of people who aren’t likely to benefit personally. Again, I don’t necessarily buy that argument myself, in part because the total costs involved might well go down when they’re redistributed and split. But I don’t think it’s a crazy argument. It’s the same basic logic as, say, that old PubCreds idea Owen Petchey and I had to ensure that people are willing to do peer reviews. It’s the logic that says you resolve tragedies of the commons by “privatizing” the commons.
Shortish answers, which maybe I’ll expand to a post level at some point:
1. I think that data is one of the most fundamental units of science and that as a result funders and journals are right to require them as part of the outputs of science. Making data available lets us reproduce analyses and debate conclusions, it lets us avoid multiple groups repeatedly spending grant dollars conducting similar or identical tasks, and it lets us stand on the shoulders of other scientists. Reasonable costs associated with making this happen are more than worth it.
2. The marginal cost of archiving is small if you’re managing data well in the first place. Arguments about how time consuming archiving is typically rest on the time it takes to get the data in shape to be shared, but I think that keeping data in good shape is actually beneficial to the individual researcher because it makes them more efficient in the long run. This is particularly true of PIs who have people leave their labs without completing projects. Keeping data well documented and structured is analogous to keeping a good lab notebook and documenting your code. It takes a little training (which is why I invest a lot of time and effort in that space), but it’s better for the individual, it’s better for the lab, and it’s better for science. The marginal cost of uploading already well managed data to Dryad is minimal and will continue to decrease.
3. In the US most grant programs explicitly require data management plans to ensure that the costs associated with data management and provision are covered. In the same way that I shouldn’t be funded to produce software that I won’t make openly available, or to write papers that I won’t publish, I shouldn’t be funded by tax payer dollars to collect data that won’t be made broadly useful.
Pingback: Friday links: how to do good research, results-free review, and more | Dynamic Ecology
It is my opinion that a study of the valuation of particular kinds of ‘nearly universal’ and ‘very particular’ data is required. The utility of the valuation is manifold; but the key insights that would drive the study are that this data is frequently of sharply increased value when combined with other data that is very different, and of marginal increased value when combined with very similar data, costs a foreseeable and reproducible amount to obtain, and is of theoretically decaying value as it is mined for insights. So there are four major factors going into the valuation: the quantity and distribution of public data, the quantity and distribution of private data, the cost to obtain/re-obtain the data, and the time it has been possessed under public and private contexts.