Jabberwocky Ecology

Weecology at ESA

We have a modest sized group of current folks at ESA this week presenting on all the cool things they’ve been doing. We’re also around and always happy to try to find time to grab a coffee or just a few minutes to chat science.

Our schedule for the week is:

Monday

Get a double dose of rapid change in ecological communities from the Portal Project with Morgan Ernest and Erica Christensen.

02:50 PM – 03:10 PM in C120-121. Erica Christensen (w/Dave Harris & Morgan Ernest). Novel approach for the analysis of community dynamics: Separating rapid reorganizations from gradual trends.

03:20 PM – 03:40 PM in C120-121. Morgan Ernest (w/Erica Christensen). Do existing communities slow community reorganization in response to changes in assembly processes?

Tuesday

Find out what we can learn about how natural systems may change in response to climate from looking at large datasets with Ethan White and Kristina Riemer.

01:50 PM – 02:10 PM in D139. Kristina Riemer (w/Rob Guralnick & Ethan White). No general relationship between mass and temperature in endotherm species.

02:30 PM – 02:50 PM in Portland Blrm 256. Ethan White (w/Dave Harris & Shawn Taylor). Data-intensive approaches to forecasting biodiversity.

Thursday

Check out a new project with a new and exciting research tool for us (metabarcoding) at the poster session.

04:30 PM – 06:30 PM in the Exhibit Hall. Ellen Bledsoe (w/Sam Wisely & Morgan Ernest). DNA metabarcoding of fecal samples provides insight into desert rodent diet partitioning.

Collaborations

There are also plenty of weecology collaborations being presented this week:

We’re really looking forward to catching up with old friends and meeting new people this week.

The Portal Project 40th Anniversary

The Portal Project turns 40 this year! In celebration, we will be regularly posting about the history of the site, new things going on, natural history of the desert, and other fun things over at the Portal Blog.

The Portal Project

Funded by the National Science Foundation to study the importance of competition and granivory in desert ecosystems, the Portal Project first started collecting data in the summer of 1977. The initial grant was just for 5 years, yet 40 years later the site is still collecting data on plants, rodents, and weather.

To our friends who study paleoecology, 40 years is an eyeblink but in the span of a human life, 40 years is a long time. As you might expect, much has changed on the project. For one thing, after 40 years, the team running the site has changed. The original team of scientists, Jim Brown, Dinah Davidson, and Jim Reichman have all retired from the daily challenges of training students and writing grants, though some are still doing science. In their place, Tom Valone and I do our best to keep things running, studying the mysteries of the…

View original post 578 more words

Is it OK to cite preprints? Yes, yes it is.

Should you cite preprints in your papers and should journals allow this? This is a topic that gets debated periodically. The most recent round of Twitter debate started last week when Martin Hunt pointed out that the journal Nucleic Acids Research wouldn’t allow him to cite them. A couple of days later I suggested that journals that don’t allow citing preprints are putting their authors’ at risk by forcing them not to cite relevant work. Roughly forty games of Sleeping Queens later (my kid is really into Sleeping Queens) I reopened Twitter and found a roiling debate over whether citing preprints was appropriate at all.

The basic argument against citing preprints is that they aren’t peer reviewed. E.g.,

and that this could lead to the citation of bad work and the potential decay of science. E.g.,

There are three reasons I disagree with this argument:

  1. We already cite lots of non-peer reviewed things in ecology
  2. Lots of fields already do this and they are doing just fine.
  3. Responsibility for the citation lies with the citer

We already cite non-peer reviewed things in ecology

As Auriel Fournier, Stephen Heard, Michael Hoffman, TerryMcGlynn and ATMoody pointed out we already cite lots of things that aren’t peer reviewed including government agency reports, white papers, and other “grey literature”.

We also cite lots of other really important non-peer reviewed things like data and software. We been doing this for decades. Ecology hasn’t become polluted with pseudo science. It will all be OK.

Lots of other fields already do this

One of the things I find amusing/exhausting about biologists debating preprints is ignorance of their history and use in other fields. It’s a bit like debating the name of an actor for two hours when you could easily look it up on Google.

In this particular case (as Eric Pedersen pointed out) we know that citation of preprints isn’t going to cause problems for the field because it hasn’t caused issues in other fields and has almost invariably become standard practice in fields that use preprints. Unless you think Physics and Math are having real issues it’s difficult to argue that this is a meaningful problem. Just ask a physicist

You are responsible for your citations

Why hasn’t citing unreviewed work caused the wheels to fall off of science? Because citing appropriate work in the proper context is part of our job. There are good preprints and bad preprints, good reports and bad reports, good data and bad data, good software and bad software, and good papers and bad papers. As Belinda Phipson, Casey Green, Dave Harris and Sebastian Raschka point out it is up to us as the people citing research to make professional judgments about what is good science and should be cited. Casey’s take captures my thoughts on this exactly:

TLDR

So yes, you should cite preprints and other unreviewed things that are important for your work. That’s called proper attribution. It has worked in ecology and other fields for decades. It will continue to work because we are scientists and evaluating the science we cite is part of our jobs. You can even cite this blog post if you want to.

Thanks to everyone both linked here and not for the spirited discussion. Sorry I wasn’t there, but Sleeping Queens is a pretty awesome game.

UPDATE: For those of you new to this discussion, it’s been going on for a long time even in biology. Here is Graham Coop’s excellent post from nearly 4 years ago.

UPDATE: Discussion of why it’s important to put preprint citations are in the reference list

Data Analyst position in ecology research group

The Weecology lab group run by Ethan White and Morgan Ernest at the University of Florida is seeking a Data Analyst to work collaboratively with faculty, graduate students, and postdocs to understand and model ecological systems. We’re looking for someone who enjoys tidying, managing, manipulating, visualizing, and analyzing data to help support scientific discovery.

The position will include:

  • Organizing, analyzing, and visualizing large amounts of ecological data, including spatial and remotely sensed data. Modifying existing analytical approaches and data protocols as needed.
  • Planning and executing the analysis of data related to newly forming questions from the group. Assisting in the statistical analysis of ecological data, as determined by the needs of the research group.
  • Providing assistance and guidance to members of the research group on existing research projects. Working collaboratively with undergraduates, graduate students and postdocs in the group and from related projects.
  • Learning new analytical tools and software as needed.

This is a staff position in the group and will be focused on data management and analysis. All members of this collaborative group are considered equal partners in the scientific process and this position will be actively involved in collaborations. Weecology believes in the importance of open science, so most work done as part of this position will involve writing open source code, use of open source software, and production and use of open data.

Weecology is a partnership between the White Lab, which studies ecology using quantitative and computational approaches and the Ernest Lab, which tends to be more field and community ecology oriented. The Weecology group supports and encourages members interested in a variety of career paths. Former weecologists are currently employed in the tech industry, with the National Ecological Observatory Network, as faculty at teaching-focused colleges, and as postdocs and faculty at research universities. We are also committed to supporting and training a diverse scientific workforce. Current and former group members encompass a variety of racial and ethnic backgrounds from the U.S. and other countries, members of the LGBTQ community, military veterans, people with chronic illnesses, and first-generation college students. More information about the Weecology group and respective labs is available on our website. You can also check us out on Twitter (@skmorgane, @ethanwhite, @weecology, GitHub, and our blog Jabberwocky Ecology.

The ideal candidate will have:

  • Experience working with data in R or Python, some exposure to version control (preferably Git and GitHub), and potentially some background with database management systems (e.g., PostgreSQL, SQLite, MySQL) and spatial data.
  • Research experience in ecology
  • Interest in open approaches to science
  • Experience collecting or working with ecological data

That said, don’t let the absence of any of these stop you from applying. If this sounds like a job you’d like to have please go ahead and put in an application.

We currently have funding for this position for 2.5 years. Minimum salary is $40,000/year (which goes a pretty long way in Gainesville), but there is significant flexibility in this number for highly qualified candidates. We are open to the possibility of someone working remotely. The position will remain open until filled, with initial review of applications beginning on May 5th. If you’re interested in applying you can do so through the official UF position page. If you have any questions or just want to let us know that you’re applying you can email Weecology’s project manager Glenda Yenni at glenda@weecology.org.

Postdoctoral research position in the Temporal Dynamics of Communities

The Weecology lab group run by Morgan Ernest and Ethan White at the University of Florida is seeking a post-doctoral researcher to study changes in ecological communities through time. This position will primarily involve broad-scale comparative analyses across communities using large time-series datasets and/or in-depth analyses of our own long-term dataset (the Portal Project). Experience with any of the following is useful, but not required: long-term data, macroecology, paleoecology, quantitative/theoretical ecology, and programming/data analysis in R or Python. The successful applicant will be expected to collaborate on lab projects on community dynamics and develop their own research projects in this area according to their interests.

Weecology is a partnership between the Ernest Lab, which tends to be more field and community ecology oriented and the White Lab, which tends to be more quantitative and computationally oriented. The Weecology group supports and encourages students interested in a variety of career paths. Former weecologists are currently employed in the tech industry, with the National Ecological Observatory Network, as faculty at teaching-focused colleges, and as postdocs and faculty at research universities. We are also committed to supporting and training a diverse scientific workforce. Current and former group members encompass a variety of racial and ethnic backgrounds from the U.S. and other countries, members of the LGBTQ community, military veterans, people with chronic illnesses, and first-generation college students. More information about the Weecology group and respective labs is available on our website. You can also check us out on Twitter (@skmorgane, @ethanwhite, @weecology), GitHub, and our blog Jabberwocky Ecology.

This 2-year postdoc has a flexible start date, but can start as early as June 1st 2017. Interested students should contact Dr. Morgan Ernest (skmorgane@ufl.edu) with their CV including a list of three references, a cover letter detailing their research interests/experiences, and one or more research samples (a PDF or link to a scientific product such as a published paper, preprint, software, data analysis code, etc). The position will remain open until filled, with initial review of applications beginning on April 24th.

Data Retriever 2.0: We handle the data so you can focus on the analysis

We are very exited to announce a major new release of the Data Retriever, our software for making it quick and easy to get clean, ready to analyze, versions of publicly available data.

The Data Retriever, automates the downloading, cleaning, and installing of ecological and environmental data into your choice of databases and flat file formats. Instead of hours tracking down the data on the web, downloading it, trying to import it, running into issues (e.g, non-standard nulls, problematic column names, encoding issues), fixing one problem, and then encountering the next, all you need to do is run a single command from the command line:

$ retriever install csv iris
$ retriever install sqlite breed-bird-survey -f bbs.sqlite

or from R:

>>> rdataretriever::install('postgres', 'wine-quality')
>>> portal_data <- rdataretriever::fetch('portal')

The Data Retriever uses information in Frictionless Data datapackage.json files to automatically handle all of the complexities of “simple” data for you. For more complicated complicated datasets, with dozens of components or major data structure issues, the Retriever uses Python scripts as plugins to handle the major data cleaning work and then automatically handles the rest.

To find out more about the Data Retriever checkout the websites, the full documentation, and the GitHub repositories for both the Data Retriever and the R Data Retriever package.

Expanded focus and name change

For those of you familiar with the EcoData Retriever, this is the same software with a new name. Challenges with the data end of the analysis pipeline occur across disciplines and our tools work just as well for non-ecological data, so we’ve started adding non-ecological data and changed our name to reflect that. We’d love to hear from anyone interested in leading a push to add data from another discipline or just interested in adding a single favorite dataset.

As part of this we’ve changed the name of the R package from ecoretriever to rdataretriever.

Major changes

The 2.0 release includes a number of major changes including:

  • Python 3 support (a single code base runs on both Python 2 and 3)
  • Adoption of the frictionless data datapackage.json standard (replacing our old YAML like metadata system), including a command line interface for creating and editing datapackage.json files
  • Add json and xml as available output formats
  • Major expansion of the documentation and hosting of the documentation at Read the Docs
  • Remove the graphical user interface (to allow us to focus that development time on wrappers for other languages)
  • Lots of work under the hood and major improvements in testing
  • Broaden scope to include non-ecological data

We are also in the process of releasing version 1.0 of the R package. This version adds the new features in the Data Retriever and also includes major stability improvements, in particular in RStudio and on Windows.

We also have a brand new website.

Upgrading to the new version (UPDATED)

To ensure the smoothest upgrade to the new version we recommend:

  1. Run retriever reset scripts from the command line
  2. Uninstall the old version of the EcoData Retriever
  3. Install the new version
  4. Run retriever update from the command line

Acknowledgments

Henry Senyondo is the lead developer for the Data Retriever and has done an amazing job over the past year developing new features and shoring up the fundamentals for the software. He lead the work on 2.0 start to finish.

Akash Goel was a Google Summer of Code student with the project last summer and was responsible for the majority of the work adding Python 3 support and switching the project over to the datapackage.json standard.

Dan McGlinn, the creator of the R package, has continued his excellent leadership of the development of this package. Shawn Taylor, a new contributor, was instrumental in solving the stability issues on Windows/RStudio.

In addition to these core folks our growing group of contributors to both projects have been invaluable for adding new functionality, fixing bugs, and testing new changes. We are super excited to have contributions from 30 different people and will keep working hard to make sure that everyone feels welcome and supported in contributing to the project.

The level of work done to get these releases out the door was only possible due to generous support of the Gordon and Betty Moore Foundation’s Data Driven Discovery Initiative. This support allowed my group to employ Henry as a full time software engineer to work on these and other projects. This kind of active support for the development and maintenance of research oriented software makes sustainable software development at universities possible.

The potential for collaborative open lesson development for college coursework

Last week Zack Brym and I formally announced a semester long Data Carpentry course that we’ve have been building over the last year. One of the things I’m most excited about in this effort is our attempt to support collaborative lesson development for university/college coursework.

I’ve experience first hand the potential for this sort of collaborative lesson development though the development of workshop lessons in Software Carpentry and Data Carpentry. Many of the workshop lessons developed by these two organizations now have 100+ contributors. As far as I’m aware, Software Carpentry was the first demonstration that large-scale open collaboration on lessons could work (but I’d love to hear of earlier examples if folks are aware of them) and it has resulted in what is widely regarded as really high quality lesson material. Having seen this work so effectively for workshops, I’m interested in seeing how well it can work for full length courses.

Most college and university courses that I’m aware of start in one of three ways: 1) someone sites down and develops a course completely from scratch; 2) the course directly follows a text book; or 3) a new professor inherits a course from the person who taught it previously and adapts it.

Developing a course from scratch, even one following a text book fairly closely, is a huge time commitment. In contrast, with collaboratively developed courses new faculty, or faculty teaching new courses, wouldn’t need to start from scratch. They would be able to pick up an existing course to adapt and improve. I can’t even begin to describe how much easier this would have made my first few years as a faculty member. More generally, if we are teaching similar courses across dozens or hundreds of universities, it is much more efficient to share the effort of building and improving those courses than to have each person who teaches them do so independently.

In addition to the time and energy, there are often a lot of things that don’t work well the first time you teach a course and it typically takes a few rounds of teaching it to figure what works best. One of the challenges of developing lessons in isolation is that you only teach a class every 1 or 2 years. This makes it hard and slow to figure out what needs work. In contrast, a collaboratively developed course might be taught dozens or hundreds of times each year, allowing the course to be improved much more rapidly through large scale sampling and discussion of what works and what doesn’t. In addition to having more information, the fact that faculty are spending less time developing courses from scratch should leave them with more time for improving the materials. In combination this results in the potential for higher quality courses across institutions.

By involving large numbers of lesson developers, collaborative development also has the potential to help make courses more accurate, more up-to-date, and more approachable by novices. More lesson developers means a greater chance of having an expert on any particular topic involved, thus making the material more accurate and reducing the amount of bad practice/knowledge that gets taught. New faculty with more recent training on the development team can help keep both the material and the pedagogical practices up-to-date (this is hard when the same person teaches the same course for 20 years). More lesson developers also increases the likelihood that someone who isn’t an expert in any given piece of material is also involved, which should help make sure that the lesson avoids issues with expert blindness, thus making the material more accessible to students.

Collaborative college/university lesson development will not be without challenges. The skills required for collaborative lesson development in the style of Software and Data Carpentry require proficiency with computational approaches not familiar to many academics. The necessary skills include things like version control, developing materials in markdown, and working with static site generators like Jekyll. This means this approach is currently most accessible for those with some computational training and may initially work best for computing focused courses. In addition, organizing open collaborations takes time and energy, as does collectively deciding on how to design and update classes. Universities and colleges are not typically good at valuing time invested in non-traditional efforts and that would need to change to help support those managing development of courses with large numbers of faculty involved. More substantial may be the fact that faculty are not used to collaborating with other people on course development and are therefore not used to compromising and negotiating what should go into a course. This can be compensated for to some degree by making courses easy to modify and customize, as we’ve tried to do with the Data Carpentry Semester course, but ultimately there will still need to be a shift from prioritizing the personal desires of the faculty member to the best interests of the course more broadly. This approach will likely work best where there are a number of places that all want to teach the same general material.

Is it time? When I built my first version of Programming for Biologists back in 2010 I was really excited about the potential for collaborative open course development. I built the course using Drupal, emailed a bunch of my friends who were teaching similar courses and said “Hey, we should work together on this stuff”, and stuck some welcoming language on the homepage. Nothing happened. A few years later I was on sabbatical at the University of North Carolina and got the opportunity to talk a fair bit with Elliot Hauser who was part of a team trying to encourage this through a start-up called Coursefork. I was somewhat skeptical that this approach would work broadly at the time, but I thought it was really awesome that they were trying. They ended up pivoting to focus on helping computing education through a somewhat different route and became trinket. A couple of years I converted my course to Jekyll on GitHub and told a lot of people about it. There was much excited. Still nothing happened. So why might this work now? I think there are three things that increase the possibility of this becoming a bigger deal going forward. First, open source software development is becoming more frequent in academia. It still isn’t rewarded anywhere close to sufficiently, but the ethos of using and contributing to collaboratively developed tools is growing. Second, the technical tools that make this kind of collaboration easier are becoming more widely used and easier to learn through training efforts like Software Carpentry. Third, more and more people are actively developing university courses using these tools and making them available under open source licenses. Two of my favorites are Jenny Bryan’s Stat 545 and Karl Broman’s Tools for Reproducible Research. Our development of the Data Carpentry semester course has already benefited from using openly available materials like these and feedback from members of the computational teaching community. I guess we’ll see what happens next.

This post benefited from a number of comments and suggestions by Zack Brym, who has also played a central and absolutely essential role in the development of the Data Carpentry semester long course. The post also benefited from several conversations with Tracy Teal, the Executive Director of Data Carpentry about the potential value of these approaches for college courses

Fork our course: A semester-long Data Carpentry course for biologists

This is post is co-authored by Zack Brym and Ethan White

Over the last year and a half we have been actively developing a semester-long Data Carpentry course designed to be easily customized and integrated into existing graduate and undergraduate curricula.

Data Carpentry for Biologists contains course materials for teaching scientists how to work more effectively with data. The course provides introductions to data management and relational databases, data manipulation and analysis, and data visualization. It covers the same general types of material as a two-day Data Carpentry workshop, but expands the materials and opportunities for practice into a full-length university course. The teaching material uses R and SQLite, with some corresponding materials for Python as well. To help students understand the direct applications to their interests, the examples and exercises focus on biological questions and working with real data. The course emphasizes using best practices to produce reusable and reproducible data analysis.

Active-learning Teaching Materials

Learning computing requires active practice by working through programming problems. Just diving in to computing is challenging for most scientists, so the course instruction is designed to combine short live-coding introductions to concepts followed immediately by the students working on a related exercise. Additional exercises are assigned later for practice. This follows the “I do”, “We do”, “You do” approach to teaching, which leverages the benefits of active-learning and flipped classrooms without leaving students who are less comfortable with the material feeling lost. The bulk of class time is spent working on assigned exercises with the instructor moving around the room helping guide students through things they don’t understand and engaging with students who are thinking about advanced applications of what they’ve learned.

This approach is the result of lots of reading about effective teaching methods and Ethan’s experience teaching this and related courses over the last six years at Utah State University and the University of Florida. It seems to work well for both students that get the material easily and those that find it more challenging. We’ve also tried to make these materials as useful as possible for self-guided students.

Open course development

Software Carpentry and Data Carpentry have shown how powerful collaborative lesson development can be and we’re interested in bringing that to the university classroom. We have designed the course materials to be modular and easy to modify, and the course website easy to clone and set up. All of the teaching materials and associated website files are openly available at the Data Carpentry for Biologists repository on GitHub under CC-BY and MIT licenses. The course materials are all written in Markdown and everything runs on Jekyll through GitHub Pages. Making your own version of the course should take less than an hour. We’ve developed documentation for how to create your own version of the course and how to contribute to development. Exercises and assignments are modular and changing exercises and assignments simply involves reordering items in a list. Adding a new exercise involves creating a new Markdown file and then adding its title to the list of exercises for an assignment.

Get Involved

If you teach, or want to teach, a course like this, we’d love to get you involved. Here are some useful links for getting started.

–   I want to teach the course.
–   I have some feedback.
–   I want to contribute to the project.

We want to be sure getting involved is as easy as possible. We’ve worked hard to provide documentation and help resources for students and instructors. Students can find all they need to know at our student start guide. Instructors have access to course content and site design documentation.

If your having trouble finding something or getting something to work, or simply have some feedback about the course please open a new issue at GitHub or send us an email.

Acknowledgements

Development of this course was generously support by  the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.

Walking the beat

Counting plants and rats in the desert is fun to collect but also hard won data. Another Portal Project blog post is up!

The Portal Project

img_4185

Last month we highlighted the brains (and a little brawn) of The Portal Project, with a description of the new regime shift research and the requisite hardware-cloth-battle of 2015.  This month we bring out the big brawn guns (and some brains) to show you how the site keeps its youthful glow year after year in the unforgiving desert.  If we could bottle this Portal magic, it would be a best-seller for sure. Here’s our best attempt:

Portal_crew.png

Do you like long walks in the desert?  Do you love cute, cuddly animals?  Do you like to take long walks in the desert with rattlesnakes, scorpions, and tarantulas when it’s 115 degrees out, and do you still love cute, cuddly animals when they bite you and poop on you?  If you answered yes to the last question, you might be Portal Protectors! material. (By the way, Morgan is recruiting another PhD…

View original post 882 more words

PhD Student Opening, Community Ecology, Ernest Lab

PH.D STUDENT OPENING IN COMMUNITY ECOLOGY IN THE ERNEST LAB

20916401782_019d3c14eb_zThe Ernest Lab at the University of Florida has an opening for a Ph.D student in the area of Community Ecology to start fall 2017.  The student will be supported as a graduate research assistant as part of an NSF-funded project at the Portal Project, a long-term research site in southeastern Arizona to study regime shifts (rapid shifts in ecosystem structure and function). A short version of the grant proposal can be found here. This position will participate in data collection efforts in Arizona on rodents and plants.

The Ernest lab is interested in general questions about the processes that structure communities, with a particular focus on understanding when and how ecological communities change through time. Students are free to develop their own research projects depending on their interests. Examples of questions in community ecology that students have pursued as part of their dissertation include: Does strong frequency dependence help buffer rare species from stochastic extinctions?, Are biodiversity patterns sensitive to changes in biotic interactions?, and Do disturbances impact species populations and community-level properties similarly?

The Ernest Lab is part of the Weecology research group, Weecology is a partnership between the Ernest Lab, which tends to be more field and community ecology oriented and the White Lab, which tends to be more quantitatively and computationally oriented. The Weecology group supports and encourages students interested in a variety of career paths. Former weecologists are currently employed in the tech industry, with the National Ecological Observatory Network, at teaching-focused colleges, and as postdocs in major research groups. We are also committed to supporting and training a diverse scientific workforce. Current and former group members include a variety of racial and ethnic backgrounds from the U.S. and other countries, members of the LGBT community, military veterans, and students who are the first generation in their family to go to college.

Interested students should contact Dr. Morgan Ernest (skmorgane@ufl.edu) by Oct 24th, 2016 with their CV, transcripts (unofficial are perfectly ok),  and a brief statement of research interests.

ALL GOOD THINGS MUST END, OR SHIFT

The next installment from the Portal Blog by my student Joan Meiners (@beecycles) on how we shook things up at the Portal Project so we could study regime shifts.

The Portal Project

We return this week from our special, breaking-news post about the recent reappearance of our one-hit-wonder, Twitter-sensation, spectabulous Banner-tailed Kangaroo Rat. This T-Rex of Portal may not be here to stay, but we’re sure excited she stopped by. What is here to stay is that pesky plot switch we mentioned last month. We’re going to continue our series of Portal science updates and tell you all about that now:


REGIME SHIFTS AND A NEW FRONTIER AT PORTAL

The last time we checked in at this blog prior to the 2015 plot switch, Erica was battling monsoon season to record desert rodent dynamics on the twenty-four long-term experimental plots that have been censused almost monthly since the site was established in 1977 by James Brown, James Reichman, and Diane Davidson. That’s thirty-nine years of tracking the occurrence of various species of small mammals. That’s over four hundred visits to Portal, AZ…

View original post 1,644 more words

Should Long-term Data be Open Access: Fears, Anecdotes, and Data

From how we do science to publishing practices to the sociology of science, there isn’t an aspect of the scientific endeavor that isn’t in flux right now. Long-term readers of Jabberwocky know that understanding how the scientific endeavor is changing and figuring out how to maximize the good for science and minimize the bad is a bit of an obsession for us. Ethan has been a tireless proponent (or “a member of the radical fringe” as one former ESA president called him) for changes in scientific publishing and reproducibility. For me, the issue close to my heart is data availability. For me, this is a “for the good of science” issue. By definition, science relies on data. If data is stuck in a drawer and no one knows about it or it becomes technologically inaccessible (I’m looking at you 5 1/4” floppy disk) then it effectively does not exist for the scientific endeavor. At best it is knowledge that needs time and resources (neither of which we have in abundance) to reacquire. At worst that is knowledge lost forever.

Fears

But publishing one’s data for the world to use is not the ecology way – in part because extracting data from nature is hard. Much of the resistance is because scientists are afraid they will lose out on credit for that hard work. If they are the one to publish on that data, they get the credit. If someone else does, the credit is less. Regularly, I see a journal article, or a tweet, or a blog post worried about the increasing push to make data publicly available. Most of these just make me sad for science, but one in particular has haunted me for a while because it focused on something near and dear to my heart: Long-Term Data. A paper published by Mills et al. in TREE last year argued that sharing long-term data will kill off long-term studies. This paper conducted a survey of 73 Principal Investigators running long-term ecology projects. Almost all said they were in favor of sharing data “with the agreement or involvement of the PI”. Only 8% were in favor of “open-access data archiving”1. 91% supported data-sharing when there were clear rules for how/when the data would be shared. Suggestions for rules included “(i) coauthorship or at least acknowledgment, depending on the level of PI involvement; (ii) no overlap with current projects, particularly projects conducted by students or postdoctoral fellows; and (iii) an agreement that the data go no further than the person to whom they are entrusted.”

My colleagues were so against open-access data archiving that many said they would rather submit their science to a less high profile journal if publishing in a more high profile journal required them to archive their data2. The paper argues that this type of decision making will result in less impactful papers and harm the careers and funding opportunities for scientists studying long-term ecology. Fears were expressed that flawed science would be produced without the input of the PIs, that being scooped would damage the careers of their trainees, concern that time would get wasted due to redundant analyses being conducted as multiple labs do the same analyses, a reduction in the number of long-term studies due to lower incentives for conducting this type of science, less collaboration, and lost opportunities to obtain new funding because research is being done by other groups. You get the idea. Sharing data results in lost opportunities to author papers with cascading consequences.

Anecdotes

Having just published the next installment of the Portal Project long-term data and begun our on-line data streaming experiment, it seems like an ideal time to talk about my experiences and concerns with sharing long-term data. After all, unlike many people who have expressed these fears, my raw data3 has been openly available since 2009. How calamitous has this experience been for me and my students?

Since the database was published in 2009, it has been cited 16 times. That’s about 2ish papers a year, not exactly an impressive flurry of activity – though you could argue that it would still be a significant boost to my productivity. But the picture changes when you look at how exactly the data are being used. Of those 16 citations, 4 cite the Data Paper to support statements about long-term data being published, 4 use the data as one of many datapoints as part of a meta-analysis/macroecological study, 3 use the data to plot a data-based example of their idea/concept/tool (i.e. the data is not used as part of an analysis that is being interpreted), 3 use the data to ask a scientific question focused on the field site, 1 cites the data paper for a statement in the metadata about the importance of experimental manipulations and 1 cites it for reasons I cannot ascertain because I couldn’t access the paper but the abstract makes it clear they are not analyzing our data. No one is going to add me a co-author to cite the existence of long-term data, cite statements made in the metadata, or make an example figure, so we’re down to 7 papers that I “lost” by publishing the data. But it gets even worse4. I am already a co-author on the 3 papers that focus on the site. So, now we’re down to 4 meta-analysis/macroecological studies. As someone who conducts that type of research I can tell you that I would only need to include someone as an author to get access to their data if I desperately needed their particular data for some reason (i.e. location, taxa, etc) or if I can’t get enough sites to make a robust conclusion otherwise. There is a lot of data available in the world through a variety of sources (government, literature, etc). Given the number of studies used in those 4 papers, if I had demanded authorship for use of my data, my data would probably not have been included.

Final tally: We published one of the few (and among the longest-term) long-term datasets on climate, plants, and consumers for a single site in existence. This dataset even includes a long-term experimental manipulation (the ‘gold standard’ of ecology). That data has been openly available with no limitations for 7 years. I cannot yet point to an instance where someone used the data in a way that has cost us a paper – either through scooping us or because if the data had not been available they would have been forced to collaborate with me.

In fairness, I don’t expect that to be true forever, but right now our challenge isn’t how to avoid being scooped, it’s how do we get anyone to use our data!!! When I talk about this data online or give talks, invariably people tell me “you are so lucky to have a data set like that”. My response: The entire world has a data set exactly like this because we published it! Not one of those people has published anything with it.

Data

My experience is not unique. A response paper was published this year by Simon Evans examining this lost opportunity cost of making long-term data publicly available. Evans combed through Dryad (the data archival site for many ecology and evolution journals) to see how many times long-term data archived on the site had been used. Using a 5 year or more definition for long-term data, there were 67 publicly accessible datasets on Dryad.5 How often had these data been used? Using citations to the data package, examining citations to the original paper associated with the data package, and contacting the data authors to see if they knew of instances of their data being used, Evans found that there were 0 examples of deposited data being reused by investigators not authors on the original study.5

Most people I know who are looking for data often forget about Dryad, so maybe Dryad just hasn’t been ‘discovered’. I would be interested to know how Evans’ result compares to data being downloaded from Ecological Archives. But given our experience with the open Portal Project data on Ecological Archives, I suspect differences in long-term data usage between different repositories is small.

Positives?

So, currently there is no evidence that publishing long-term data results in the negative impacts based on the Mills et al paper. Does it have a positive impact? For long-term data, I’m currently unclear because so few people seem to be using that type of data. But I have published some macroecological data sets in the past (Ernest 2003, Smith et al 2003, Myhrvold et al 2015) and there have definitely been positives for me. No, it has not resulted in papers for me, but I have also not been scooped on anything I was actively working on or even seriously interested in pursuing. But they have resulted in a fair number of citations (397 to date via Google Scholar), they contribute to my h-index (which universities use to judge me), and have definitely contributed to my name recognition in that area of ecology. (I have been tongue tied on more than one occasion when a big name walked up to me and thanked me for publishing my 2003 mammal life history dataset). No, these aren’t publications, but name recognition is ever harder to obtain in the increasingly crowded field of science, and citation and impact metrics (for better or worse) are increasingly a part of the assessment of scientists. So yes, I believe that publishing datasets in general has been a net positive for me.

Finally, I can also say that NSF is watching. At the end of my previous NSF grant supporting the research at the field site, my program officer was in my ear reminding me that as part of that grant I needed to publish my data. I don’t know how other people feel about this, but I feel that a happy NSF is a positive.

Summary

So, in my experience, publishing my long-term data has not resulted in the grand implosion of my research group. If anything, I think the relative dearth of activity using the long-term data – especially in comparison to the macroecological datasets – suggests that very few people are actually using long-term data. To me, this lack of engagement is much more dangerous for the continuation of funding for long-term ecology than the nebulous fears of open data. If people don’t actively see how important this type of data is, why would they ever recommend for it to be funded? Why prioritize funding long-term data collection – a data type most ecologists have never used and don’t understand – over an experiment which most ecologists do and understand. We need more advocates for long-term ecology and I don’t believe you can do that by tightly controlling access so only a lucky few have access to it. So if you’re wondering why we now stream the data nearly live on GitHub, or why we make the data available to be used as a teaching database for DataCarpentry, that’s why. Long-term datasets – and a large body of scientists who understand how to work with them – are going to be important in tackling questions about how and why ecosystems change through time (not just in the past but into the future). This makes increasing the number of people working with long-term data a win for science – and in the long-run I believe it will be a win for me and others who invest so much blood, sweat and tears into generating long-term data.


1 I have never had quantitative evidence before that I was not “normal”. My reaction was to give myself a pat on the back, but I couldn’t figure out if the pat was consolatory or congratulatory.

2 This language is drawn from the original paper and does not reflect my opinions on “high” vs “low” impact journals.

3 In the spirit of complete disclosure, this data is not exactly mine. I’ve collected a lot of it since 1995, but the site was started by Jim Brown, Diane Davidson, and Jim Reichman in 1977 and many people have been involved in collecting the site’s data. But I argue those expressed fears still apply to my case because when Tom Valone and I become the PIs of the project in the early 2000’s we could have opted to sit on this treasure trove of data instead of publishing it.

4 Worse depends on your point of view, of course. If you don’t want people to use your data, this is good news. I use worse here in the sense that this is bad news for the argument that publishing your data will cause you to lose papers. It is also worse from the perspective that we published this data and no one is using it.

5 72 data packages in total were identified but some of these were under embargo and not yet available

6 This makes Portal look like a rock star! Our raw data (i.e. information had to be extracted using the raw data and could not be obtained from summary statistics in one of our papers) were used in 4 meta-analysis/macroecology papers. That is literally infinitely more used than those other long-term data J