Should you cite preprints in your papers and should journals allow this? This is a topic that gets debated periodically. The most recent round of Twitter debate started last week when Martin Hunt pointed out that the journal Nucleic Acids Research wouldn’t allow him to cite them. A couple of days later I suggested that journals that don’t allow citing preprints are putting their authors’ at risk by forcing them not to cite relevant work. Roughly forty games of Sleeping Queens later (my kid is really into Sleeping Queens) I reopened Twitter and found a roiling debate over whether citing preprints was appropriate at all.
The basic argument against citing preprints is that they aren’t peer reviewed. E.g.,
and that this could lead to the citation of bad work and the potential decay of science. E.g.,
There are three reasons I disagree with this argument:
- We already cite lots of non-peer reviewed things in ecology
- Lots of fields already do this and they are doing just fine.
- Responsibility for the citation lies with the citer
We already cite non-peer reviewed things in ecology
As Auriel Fournier, Stephen Heard, Michael Hoffman, TerryMcGlynn and ATMoody pointed out we already cite lots of things that aren’t peer reviewed including government agency reports, white papers, and other “grey literature”.
We also cite lots of other really important non-peer reviewed things like data and software. We been doing this for decades. Ecology hasn’t become polluted with pseudo science. It will all be OK.
Lots of other fields already do this
One of the things I find amusing/exhausting about biologists debating preprints is ignorance of their history and use in other fields. It’s a bit like debating the name of an actor for two hours when you could easily look it up on Google.
In this particular case (as Eric Pedersen pointed out) we know that citation of preprints isn’t going to cause problems for the field because it hasn’t caused issues in other fields and has almost invariably become standard practice in fields that use preprints. Unless you think Physics and Math are having real issues it’s difficult to argue that this is a meaningful problem. Just ask a physicist
You are responsible for your citations
Why hasn’t citing unreviewed work caused the wheels to fall off of science? Because citing appropriate work in the proper context is part of our job. There are good preprints and bad preprints, good reports and bad reports, good data and bad data, good software and bad software, and good papers and bad papers. As Belinda Phipson, Casey Green, Dave Harris and Sebastian Raschka point out it is up to us as the people citing research to make professional judgments about what is good science and should be cited. Casey’s take captures my thoughts on this exactly:
So yes, you should cite preprints and other unreviewed things that are important for your work. That’s called proper attribution. It has worked in ecology and other fields for decades. It will continue to work because we are scientists and evaluating the science we cite is part of our jobs. You can even cite this blog post if you want to.
Thanks to everyone both linked here and not for the spirited discussion. Sorry I wasn’t there, but Sleeping Queens is a pretty awesome game.
UPDATE: For those of you new to this discussion, it’s been going on for a long time even in biology. Here is Graham Coop’s excellent post from nearly 4 years ago.
UPDATE: Discussion of why it’s important to put preprint citations are in the reference list
The Weecology lab group run by Ethan White and Morgan Ernest at the University of Florida is seeking a Data Analyst to work collaboratively with faculty, graduate students, and postdocs to understand and model ecological systems. We’re looking for someone who enjoys tidying, managing, manipulating, visualizing, and analyzing data to help support scientific discovery.
The position will include:
- Organizing, analyzing, and visualizing large amounts of ecological data, including spatial and remotely sensed data. Modifying existing analytical approaches and data protocols as needed.
- Planning and executing the analysis of data related to newly forming questions from the group. Assisting in the statistical analysis of ecological data, as determined by the needs of the research group.
- Providing assistance and guidance to members of the research group on existing research projects. Working collaboratively with undergraduates, graduate students and postdocs in the group and from related projects.
- Learning new analytical tools and software as needed.
This is a staff position in the group and will be focused on data management and analysis. All members of this collaborative group are considered equal partners in the scientific process and this position will be actively involved in collaborations. Weecology believes in the importance of open science, so most work done as part of this position will involve writing open source code, use of open source software, and production and use of open data.
Weecology is a partnership between the White Lab, which studies ecology using quantitative and computational approaches and the Ernest Lab, which tends to be more field and community ecology oriented. The Weecology group supports and encourages members interested in a variety of career paths. Former weecologists are currently employed in the tech industry, with the National Ecological Observatory Network, as faculty at teaching-focused colleges, and as postdocs and faculty at research universities. We are also committed to supporting and training a diverse scientific workforce. Current and former group members encompass a variety of racial and ethnic backgrounds from the U.S. and other countries, members of the LGBTQ community, military veterans, people with chronic illnesses, and first-generation college students. More information about the Weecology group and respective labs is available on our website. You can also check us out on Twitter (@skmorgane, @ethanwhite, @weecology, GitHub, and our blog Jabberwocky Ecology.
The ideal candidate will have:
- Experience working with data in R or Python, some exposure to version control (preferably Git and GitHub), and potentially some background with database management systems (e.g., PostgreSQL, SQLite, MySQL) and spatial data.
- Research experience in ecology
- Interest in open approaches to science
- Experience collecting or working with ecological data
That said, don’t let the absence of any of these stop you from applying. If this sounds like a job you’d like to have please go ahead and put in an application.
We currently have funding for this position for 2.5 years. Minimum salary is $40,000/year (which goes a pretty long way in Gainesville), but there is significant flexibility in this number for highly qualified candidates. We are open to the possibility of someone working remotely. The position will remain open until filled, with initial review of applications beginning on May 5th. If you’re interested in applying you can do so through the official UF position page. If you have any questions or just want to let us know that you’re applying you can email Weecology’s project manager Glenda Yenni at email@example.com.
We are very exited to announce a major new release of the Data Retriever, our software for making it quick and easy to get clean, ready to analyze, versions of publicly available data.
The Data Retriever, automates the downloading, cleaning, and installing of ecological and environmental data into your choice of databases and flat file formats. Instead of hours tracking down the data on the web, downloading it, trying to import it, running into issues (e.g, non-standard nulls, problematic column names, encoding issues), fixing one problem, and then encountering the next, all you need to do is run a single command from the command line:
$ retriever install csv iris $ retriever install sqlite breed-bird-survey -f bbs.sqlite
or from R:
>>> rdataretriever::install('postgres', 'wine-quality') >>> portal_data <- rdataretriever::fetch('portal')
The Data Retriever uses information in Frictionless Data datapackage.json files to automatically handle all of the complexities of “simple” data for you. For more complicated complicated datasets, with dozens of components or major data structure issues, the Retriever uses Python scripts as plugins to handle the major data cleaning work and then automatically handles the rest.
Expanded focus and name change
For those of you familiar with the EcoData Retriever, this is the same software with a new name. Challenges with the data end of the analysis pipeline occur across disciplines and our tools work just as well for non-ecological data, so we’ve started adding non-ecological data and changed our name to reflect that. We’d love to hear from anyone interested in leading a push to add data from another discipline or just interested in adding a single favorite dataset.
As part of this we’ve changed the name of the R package from
The 2.0 release includes a number of major changes including:
- Python 3 support (a single code base runs on both Python 2 and 3)
- Adoption of the frictionless data datapackage.json standard (replacing our old YAML like metadata system), including a command line interface for creating and editing datapackage.json files
- Add json and xml as available output formats
- Major expansion of the documentation and hosting of the documentation at Read the Docs
- Remove the graphical user interface (to allow us to focus that development time on wrappers for other languages)
- Lots of work under the hood and major improvements in testing
- Broaden scope to include non-ecological data
We are also in the process of releasing version 1.0 of the R package. This version adds the new features in the Data Retriever and also includes major stability improvements, in particular in RStudio and on Windows.
We also have a brand new website.
Upgrading to the new version (UPDATED)
To ensure the smoothest upgrade to the new version we recommend:
retriever reset scriptsfrom the command line
- Uninstall the old version of the EcoData Retriever
- Install the new version
retriever updatefrom the command line
Henry Senyondo is the lead developer for the Data Retriever and has done an amazing job over the past year developing new features and shoring up the fundamentals for the software. He lead the work on 2.0 start to finish.
Akash Goel was a Google Summer of Code student with the project last summer and was responsible for the majority of the work adding Python 3 support and switching the project over to the
Dan McGlinn, the creator of the R package, has continued his excellent leadership of the development of this package. Shawn Taylor, a new contributor, was instrumental in solving the stability issues on Windows/RStudio.
In addition to these core folks our growing group of contributors to both projects have been invaluable for adding new functionality, fixing bugs, and testing new changes. We are super excited to have contributions from 30 different people and will keep working hard to make sure that everyone feels welcome and supported in contributing to the project.
The level of work done to get these releases out the door was only possible due to generous support of the Gordon and Betty Moore Foundation’s Data Driven Discovery Initiative. This support allowed my group to employ Henry as a full time software engineer to work on these and other projects. This kind of active support for the development and maintenance of research oriented software makes sustainable software development at universities possible.
We are very exited to announce the newest release of the EcoData Retriever, our software for automating the downloading, cleaning, and installing of ecological and environmental data. Instead of hours or days trying to get complicated datasets like the Breeding Bird Survey ready for analysis, the Retriever lets you simply click a button or run a single command from R or the command line, and your computer does the rest.
It’s been over a year since the last retriever release and there are lots of new features and improvements to be excited about.
- We’ve added 21 new datasets, including major ecological and environmental datasets like eBird, Vertnet, and the Global Wood Density Database, and the PRISM climate data.
- To support all of these datasets we’ve added support for additional data types including greater than memory archive files, and we’ve also improved the ability to control where downloaded files are stored and how they are clustered together.
- We’ve significantly improved documentation and now have a new automatically built documentation site at Read The Docs.
- We’ve also made a lot of under the hood improvements.
This is also the first release that has been overseen by Weecology’s new software engineer, Henry Senyondo. We’re excited to have Henry on the team, and now that he’s around development of both the EcoData Retriever and other lab software projects will be happening more quickly.
A big thanks to the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative for funding this development through Grant GBMF4563 and to the National Science Foundation for funding as part of a CAREER award to Ethan White.
data <- ecoretriever::fetch("BBS")
A couple of months ago Micah J. Marty and I had a twitter conversation and subsequent email exchange about how citations worked with preprints. I asked Micah if I could share our email discussion since I thought it would be useful to others and he kindly said yes. What follows are Michah’s questions followed by my responses.
Right now, I am finishing up a multi-chapter Master’s thesis and I plan to publish a few papers from my work. I may want to submit a preprint of one manuscript but before I propose this avenue to my advisor, I want to understand it fully myself. And I have remaining questions about the syntax of citing works when preprints come into play. What happens to a citation of a preprint after the manuscript is later published in a peer reviewed venue?
At the level of the journal nothing happens. So, if you cite a preprint in a published ms, and that preprint is later published as a paper, then the citation is still to the preprint. However, some of the services indexing citations recognize the relationship between the preprint and the paper and aggregate the citations. Specifically, Google Scholar treats the preprint and the published paper as the same for citation analysis purposes. See the citation record for our paper on Best practices for scientific computing which has been cited 49 times, but the vast majority of those are citations to the preprints.
Here’s an example with names we can play with: Manuscript 1 (M1) may require some extra analysis, but it presents some important unexpected results that I would like to get out on the table as soon as possible. M1 is submitted to PeerJ Preprints and accepted (i.e., published online as a preprint with a DOI). M2 is submitted to Marine Ecology Progress Series (MEPS) for peer review, and M2 cites the PeerJ Preprint M1.
Just a point related to vocabulary, I wouldn’t typically think of the preprint as being “accepted”. Any checking prior to posting is just a quick glance to make sure that it isn’t embarrassingly bad, so as long as it’s reasonably written and doesn’t have a title like “E is not equal to mc squared” it will be posted almost immediately (within 48 hours on most preprint servers).
1) Are preprints considered “grey literature”? That is, is it illegitimate for M2 to cite a work that has not been peer reviewed?
Yes, in the sense that they haven’t been formally peer reviewed prior to posting they are similar to “grey literature”. Whether or not they can be cited depends on the journal. Some journals are happy to allow citing of preprints. For example, this recent paper in TREE cites a preprint of ours on arXiv. Their paper was published before ours was accepted, so if it wasn’t for the preprint it couldn’t have been cited.
2) Is there a problem if M1 is eventually published in a peer reviewed journal but the published article of M2 cites only the PeerJ Preprint of M1?
I would say no for two reasons. First, assuming that M2 is published before M1 then the choice is between having a citation to something that people can read, science can benefit from, and that can potentially be indexed (giving you citation credit) vs. a citation to “Marty et al. unpublished data”, which basically does nothing. Second, all preprint servers provide a mechanism for linking to the final version, so if someone finds the preprint via a citation in M2 then that link will point them in the direction of the final version that they can then read/cite/etc.
In short, I think as long as you aren’t planning on submitting to a behind the times journal that doesn’t allow the submission of papers that have been posted as preprints (and the list of journals with this policy is shrinking rapidly) then there is no downside to posting preprints. In the best case scenario it can lead to more people reading your research and citing it. The worst case scenario is exactly the same as if you didn’t post a preprint.
We are very excited to announce the newest release of our EcoData Retriever software and the first release of a supporting R package, ecoretriever. If you’re not familiar with the EcoData Retriever you can read more here.
The biggest improvement to the Retriever in this set of releases is the ability to run it directly from R. Dan McGlinn did a great job leading the development of this package and we got ton of fantastic help from the folks at rOpenSci (most notably Scott Chamberlain, Gavin Simpson, and Karthik Ram). Now, once you install the main EcoData Retriever, you can run it from inside R by doing things like:
install.packages('ecoretriever') library(ecoretriever) # List the datasets available via the Retriever ecoretriever::datasets() # Install the Gentry dataset into csv files in your working directory ecoretriever::install('Gentry', 'csv') # Download the raw Gentry dataset files, without any processing, # to the subdirectory named data ecoretriever::download('Gentry', './data/') # Install and load a dataset as a list Gentry = ecoretriever::fetch('Gentry') names(Gentry) head(Gentry$counts)
The other big advance in this release is the ability to have the Retriever directly download files instead of processing them. This allows us to support data that doesn’t come in standard tabular forms. So, we can now include things like environmental data in GIS formats and phylogenetic data such as supertrees. We’ve used this new capability to allow the automatic downloading of the Bioclim data, one of the most widely used climate datasets in ecology, and the supertree for mammals from Fritz et al. 2009.
As announced by Noam Ross on Twitter (and confirmed by the Editor in Chief of Ecology Letters), Ecology Letters will now allow the submission of manuscripts that have been posted as preprints. Details will be published in an editorial in Ecology Letters. I want to say a heartfelt thanks to Marcel Holyoak and the entire Ecology Letters editorial board for listening to the ecological community and modifying their policies. Science is working a little better today than it was yesterday thanks to their efforts.
For those of you who are new to the concept of preprints, they are manuscripts, that have not yet been published in peer reviewed journals, which are posted to websites like arXiv, PeerJ, and bioRxiv. This process allows for more rapid communication of scientific results and improved quality of published papers though more expansive pre-publication peer-review. If you’d like to read more check out our paper on The Case for Open Preprints in Biology.
The fact that Ecology Letters now allows preprints is a big deal for ecology because they were the last of the major ecology journals to make the transition. The ESA journals began allowing preprints just over two years ago and the BES journals made the switch about 9 months ago. In addition, Science, Nature, PNAS, PLOS Biology, and a number of other ecology journals (e.g., Biotropica) all support preprints. This means that all of the top ecology journals, and all of the top general science journals that most ecologists publish in, allow the posting of preprints. As such, there is not longer a reason to not post preprints based on the possibility of not being able to publish in a preferred journal. This can potentially shave months to years off of the time between discovery and initial communication of results in ecology.
It also means that other ecology journals that still do not allow the posting of preprints are under significant pressure to change their policies. With all of the big journals allowing preprints they have no reasonable excuse for not modernizing their policies, and they risk loosing out on papers that are initially submitted to higher profile journals and are posted as preprints.
It’s a good day for science. Celebrate by posting your next manuscript as a preprint.