[Preprint] Nine simple ways to make it easier to (re)use your data
I’m a big fan of preprints, the posting of papers in public archives prior to peer review. Preprints speed up the scientific dialogue by letting everyone see research as it happens, not 6 months to 2 years later following the sometimes extensive peer review process. They also allow more extensive pre-publication peer review because input can be solicited from the entire community of scientists, not just two or three individuals. You can read more about the value of preprints in our preprint about preprints (yes, really) posted on figshare.
In the spirit of using preprints to facilitate broad pre-publication peer review a group of weecologists have just posted a preprint on how to make it easier to reuse data that is shared publicly. Since PeerJ‘s commenting system isn’t live yet we would like to encourage your to provide feedback about the paper here in the comments. It’s for a special section of Ideas in Ecology and Evolution on data sharing (something else I’m a big fan of) that is being organized by Karthik Ram (someone I’m a big fan of).
Our nine recommendations are:
- Share your data
- Provide metadata
- Provide an unprocessed form of the data
- Use standard data formats (including file formats, table structures, and cell contents)
- Use good null values
- Make it easy to combine your data with other datasets
- Perform basic quality control
- Use an established repository
- Use an established and liberal license
Most of this territory has been covered before by a number of folks in the data sharing world, but if you look at the state of most ecological and evolutionary data it clearly bears repeating. In addition, I think that our unique contribution is three fold: 1) We’ve tried hard to stick to relatively simple things that don’t require a huge time commitment to get right; 2) We’ve tried to minimize the jargon and really communicate with the awesome folks who are collecting great data but don’t have much formal background in the best practices of structuring and sharing data; and 3) We contribute the perspective of folks who spend a lot of time working with other people’s data and have therefore encountered many of the most common issues that crop up in ecological and evolutionary data.
So, if you have the time, energy, and inclination, please read the preprint and let us know what you think and what we can do to improve the paper in the comments section.
UPDATE: This manuscript was written in the open on GitHub. You can also feel free to file GitHub issues if that’s more your style.
UPDATE 2: PeerJ has now enabled commenting on preprints, so comments are welcome directly on our preprint as well (https://peerj.com/preprints/7/).
Some alternative advice on how to decide where to submit your paper
Over at Dynamic Ecology this morning Jeremy Fox has a post giving advice on how to decide where to submit a paper. It’s the same basic advice that I received when I started grad school almost 15 years ago and as a result I don’t think it considers some rather significant changes that have happened in academic publishing over the last decade and a half. So, I thought it would be constructive for folks to see an alternative viewpoint. Since this is really a response to Jeremy’s post, not a description of my process, I’m going to use his categories in the same order as the original post and offer my more… youthful… perspective.
- Aim as high as you reasonably can. The crux of Jeremy’s point is “if you’d prefer for more people to read and think highly of your paper, you should aim to publish it in a selective, internationally-leading journal.” From a practical perspective journal reputation used to be quite important. In the days before easy electronic access, good search algorithms, and social networking, most folks found papers by reading the table of contents of individual journals. In addition, before there was easy access to paper level citation data, and alt-metrics, if you needed to make a quick judgment on the quality of someones science the journal name was a decent starting point. But none of those things are true anymore. I use searches, filtered RSS feeds, Google Scholar’s recommendations, and social media to identify papers I want to read. I do still subscribe to tables of contents via RSS, but I watch PLOS ONE and PeerJ just as closely as Science and Nature. If I’m evaluating a CV as a member of a search committee or a tenure committee I’m interested in the response to your work, not where it is published, so in addition to looking at some of your papers I use citation data and alt-metrics related to your paper. To be sure, there are lots of folks like Jeremy that focus on where you publish to find papers and evaluate CVs, but it’s certainly not all of us.
- Don’t just go by journal prestige; consider “fit”. Again, this used to mater more before there were better ways to find papers of interest.
- How much will it cost? Definitely a valid concern, though my experience has been that waivers are typically easy to obtain. This is certainly true for PLOS ONE.
- How likely is the journal to send your paper out for external review? This is a strong tradeoff against Jeremy’s point about aiming high since “high impact” journals also typically have high pre-review rejection rates. I agree with Jeremy that wasting time in the review process is something to be avoided, but I’ll go into more detail on that below.
- Is the journal open access? I won’t get into the arguments for open access here, but it’s worth noting that increasing numbers of us value open access and think that it is important for science. We value open access publications so if you want us to “think highly of your paper” then putting it where it is OA helps. Open access can also be important if you “prefer for more people to read… your paper” because it makes it easier to actually do so. In contrast to Jeremy, I am more likely to read your paper if it is open access than if it is published in a “top” journal, and here’s why: I can do it easily. Yes, my university has access to all of the top journals in my field, but I often don’t read papers while I’m at work. I typically read papers in little bits of spare time while I’m at home in the morning or evenings, or on my phone or tablet while traveling or waiting for a meeting to start. If I click on a link to your paper and I hit a paywall then I have to decide whether it’s worth the extra effort to go to my library’s website, log in, and then find the paper again through that system. At this point unless the paper is obviously really important to my research the activation energy typically becomes too great (or I simply don’t have that extra couple of minutes) and I stop. This is one reason that my group publishes a lot using Reports in Ecology. It’s a nice compromise between being open access and still being in a well regarded journal.
- Does the journal evaluate papers only on technical soundness? The reason that many of us think this approach has some value is simple, it reduces the amount of time and energy spent trying to get perfectly good research published in the most highly ranked journal possible. This can actually be really important for younger researchers in terms of how many papers they produce at certain critical points in the career process. For example, I would estimate that the average amount of time that my group spends getting a paper into a high profile journal is over a year. This is a combination of submitting to multiple, often equivalent caliber, journals until you get the right roll of the dice on reviewers, and the typically extended rounds of review that are necessary to satisfy the reviewers about not only what you’ve done, but satisfying requests for additional analyses that often aren’t critical, and changing how one has described things so that it sits better with reviewers. If you are finishing your PhD then having two or three papers published in a PLOS ONE style journal vs. in review at a journal that filters on “importance” can make a big difference in the prospect of obtaining a postdoc. Having these same papers out for an extra year accumulating citations can make a big difference when applying for faculty positions or going up for tenure if folks who value paper level metrics over journal name are involved in evaluating your packet.
- Is the journal part of a review cascade? I don’t actually know a lot of journals that do this, but I think it’s a good compromise between aiming high and not wasting a lot of time in review. This is why we think that ESA should have a review cascade to Ecosphere.
- Is it a society journal? I agree that this has value and it’s one of the reasons we continue to support American Naturalist and Ecology even though they aren’t quite as open as I would personally prefer.
- Have you had good experiences with the journal in the past? Sure.
- Is there anyone on the editorial board who’d be a good person to handle your paper? Having a sympathetic editor can certainly increase your chances of acceptance, so if you’re aiming high then having a well matched editor or two to recommend is definitely a benefit.
To be clear, there are still plenty of folks out there who approach the literature in exactly the way Jeremy does and I’m not suggesting that you ignore his advice. In fact, when advising my own students about these things I often actively consider and present Jeremy’s perspective. However, there are also an increasing number of folks who think like I do and who have a very different set of perspectives on these sorts of things. That makes life more difficult when strategizing over where to submit, but the truth is that the most important thing is to do the best science possible and publish it somewhere for the world to see. So, go forth, do interesting things, and don’t worry so much about the details.
UPDATE: More great discussion here, here, here and here. [If I missed yours just let me known in the comments and I”ll add it]
Graduate student opening with Weecology
We’re looking for a new student to join our interdisciplinary research group. The opening is in Ethan’s lab, but the faculty, students, and postdocs in Weecology interact seamlessly among groups. If you’re interested in macroecology, community ecology, or just about anything with a computational/quantitative component to it, we’d love to hear from you. The formal ad is included below (and yes, we did include links to our blog, twitter, and our GitHub repositories in the ad). Please forward this to any students who you think might be a good fit, and let us know if you have any questions.
GRADUATE STUDENT OPENING
The White Lab at Utah State University has an opening for a graduate student with interests in Macroecology, Community Ecology, or Ecological Theory/Modeling. Active areas of research in the White lab include broad scale patterns related to biodiversity, abundance and body size, ecological dynamics, and the use of sensor networks for studying ecological systems. We use computational, mathematical, and advanced statistical methods in much of our work, so students with an interest in these kinds of methods are encouraged to apply. Background in these quantitative techniques is not necessary, only an interest in learning and applying them. While students interested in one of the general areas listed above are preferred, students are encouraged to develop their own research projects related to their interests. The White Lab is part of an interdisciplinary ecology research group (http://weecology.org) whose goal is to facilitate the broad training of ecologists in areas from field work to quantitative methods. Students with broad interests are jointly trained in an interdisciplinary setting. We are looking for students who want a supportive environment in which to pursue their own ideas. Graduate students are funded through a combination of research assistantships, teaching assistantships, and fellowships. Students interested in pursuing a PhD are preferred. Utah State University has an excellent graduate program in ecology with over 50 faculty and 80+ graduate students across campus affiliated with the USU Ecology Center (http://www.usu.edu/ecology/).
Additional information about the position and Utah State University is available at:
http://whitelab.weecology.org/grad-student-openingInterested students can find more information about our group by checking out:
Our websites: http://whitelab.weecology.org, http://weecology.org
Our code repositories: http://github.com/weecology
Our blog: http://jabberwocky.weecology.org
And Twitter: http://twitter.com/ethanwhiteInterested students should contact Dr. Ethan White (ethan.white@usu.edu) by December 1st, 2012 with their CV, GPA, GRE scores (if available), and a brief statement of research interests.
ESA journals do not allow papers with preprints
Over the weekend I saw this great tweet:
by Philippe Desjardins-Proulx and was pleased to see yet another actively open young scientist. Then I saw his follow up tweet:
At first I was confused. I thought ESA’s policy was that preprints were allowed based on the following text on their website (emphasis mine: still available in Google’s Cache):
A posting of a manuscript or thesis on a personal or institutional homepage or ftp site will generally be considered as a preprint; this will not be grounds for viewing the manuscript as published. Similarly, posting of manuscripts in public preprint archives or in an institution’s public archive of unpublished theses will not be considered grounds for declaring a manuscript published. If a manuscript is available as part of a digital publication such as a journal, technical series or some other entity to which a library can subscribe (especially if that publication has an ISSN or ISBN), we will consider that the manuscript has been published and is thus not eligible for consideration by our journals. A partial test for prior publication is whether the manuscript has appeared in some entity with archival value so that it is permanently available to reasonably diligent scholars. A necessary test for prior publication is whether the author can legally transfer copyright to ESA.
So I asked Philippe to explain his tweet:
This got me a little riled up so I broadcast my displeasure:
And then Jarrett Byrnes questioned where this was coming from given the stated policy:
So I emailed ESA to check and, sure enough, preprints on arXiv and similar preprint servers are considered prior publication and therefore cannot be submitted to ESA journals, despite the fact that this isn’t a problem for a few journals you may have heard of including Science, Nature, PNAS, and PLoS Biology. ESA (to their credit) has now clarified this point on their website (emphasis mine; thanks to Jaime Ashander for the heads up):
A posting of a manuscript or thesis on an author’s personal or home institution’s website or ftp site generally will not be considered previous publication. Similarly posting of a “working paper” in an institutional repository is allowed so long as at least one of the authors is affiliated with that institution. However, if a manuscript is available as part of a digital publication such as a journal, technical series, or some other entity to which a library can subscribe (especially if that publication has an ISSN or ISBN), we will consider that the manuscript has been published and is thus not eligible for consideration by our journals. Likewise, if a manuscript is posted in a citable public archive outside the author’s home institution, then we consider the paper to be self-published and ineligible for submission to ESA journals. Finally, a necessary test for prior publication is whether the author can legally transfer copyright to ESA.
In my opinion the idea that a preprint is “self-published” and therefore represents prior publication is poorly justified* and not in the best interests of science, and I’m not the only one:
So now I’m hoping that Jarrett is right:
and that things might change (and hopefully soon). If you know someone on the ESA board, please point them in the direction of this post.
UPDATE: Just as I was finishing working on this post ESA responded to the tweet stream from the last few days:
I’m very excited that ESA is reviewing their policies in this area. As I should have said in the original post, I have, up until this year, been quite impressed with ESA’s generally open, and certainly pro-science policies. This last year or so has been a bad one, but I’m hoping that’s just a lag in adjusting to the new era in scientific publishing.
UPDATE 2: ESA has announced that they have changed their policy and will now consider articles with preprints.
———————————————————————————————————————————————————————–
*I asked ESA if they wanted to clarify their justification for this policy and haven’t heard back (though it has been less than 2 days). If they get back to me I’ll update or add a new post.Characterizing the species-abundance distribution with only information on richness and total abundance [Research Summary]
This is the first of a new category of posts here at Jabberwocky Ecology called Research Summaries. We like the idea of communicating our research more broadly than to the small number of folks who have the time, energy, and interest to read through entire papers. So, for every paper that we publish we will (hopefully) also do a blog post communicating the basic idea in a manner targeted towards a more general audience. As a result these posts will intentionally skip over a lot of detail (technical and otherwise), and will intentionally use language that is less precise, in order to communicate more broadly. We suspect that it will take us quite a while to figure out how to do this well. Feedback is certainly welcome.
This is a Research Summary of: White, E.P., K.M. Thibault, and X. Xiao. 2012. Characterizing species-abundance distributions across taxa and ecosystems using a simple maximum entropy model. Ecology. http://dx.doi.org/10.1890/11-2177.1*
The species-abundance distribution describes the number of species with different numbers of individuals. It is well known that within an ecological community most species are relatively rare and only a few species are common, and understanding the detailed form of this distribution of individuals among species has been of interest in ecology for decades. This distribution is considered interesting both because it is a complete characterization of the commonness and rarity of species and because the distribution can be used to test and parameterize ecological models.
Numerous mathematical descriptions of this distribution have been proposed and much of the research into this pattern has focused on trying to figure out which of these descriptions is “the best” for a particular group of species at a small number of sites. We took an alternative approach to this pattern and asked: Can we explain broad scale, cross-taxonomic patterns in the general shape of the abundance distribution using a simple model that requires only knowledge of the species richness and total abundance (summed across all species) at a site?
To do this we used a model that basically describes the most likely form of the distribution if the average number of individuals in a species is fixed (which turns out to be a slightly modified version of the classic log-series distribution; see the paper or John Harte’s new book for details). As a result this model involves no detailed biological processes and if we know richness and total abundance we can predicted the abundance of each species in the community (i.e., the abundance of the most common species, second most common species… rarest species).
Since we wanted to know how well this works in general (not how well it works for birds in Utah or trees in Panama) we put together a a dataset of more than 15,000 communities. We did this by combining 6 major datasets that are either citizen science, big government efforts, or compilations from the literature. This compilation includes data on birds, trees, mammals, and butterflies. So, while we’re missing the microbes and aquatic species, I think that we can be pretty confident that we have an idea of the general pattern.
In general, we can do an excellent job of predicting the abundance of each rank of species (most abundant, second most abundant…) at each site using only information on the species richness and total abundance at the site. Here is a plot of the observed number of individuals in a given rank at a given site against the number predicted. The plot is for Breeding Bird Survey data, but the rest of the datasets produce similar results.

Observed-predicted plot for nearly 3000 Breeding Bird Survey communities. Since there are over 100,000 points on this plot we’ve color coded them by the number of points in the vicinity of the focal point, so red areas have lots of points nearby and blue areas have very few points. The black line is the 1:1 line.
The model isn’t perfect of course (they never are and we highlight some of its failures in the paper), but it means that if we know the richness and total abundance of a site then we can capture over 90% of the variation in the form of the species-abundance distribution across ecosystems and taxonomic groups.
This result is interesting for two reasons:
First, it suggests that the species-abundance distribution, on its own, doesn’t tell us much about the detailed biological processes structuring a community. Ecologists have know that it wasn’t fully sufficient for distinguishing between different models for a while (though we didn’t always act like it), but our results suggest that in fact there is very little additional information in the distribution beyond knowing the species richness and total abundance. As such, any model that yields reasonable richness and total abundance values will probably produce a reasonable species-abundance distribution.
Second, this means that we can potentially predict the full distribution of commonness and rarity even at locations we have never visited. This is possible because richness and total abundance can, at least sometimes, be well predicted using remotely sensed data. These predictions could then be combined with this model of the species-abundance distribution to make predictions for things like the number of rare species at a site. In general, we’re interested in figuring out how much ecological pattern and process can be effectively characterized and predicted at large spatial scales, and this research helps expand that ability.
So, that’s the end of our first Research Summary. I hope it’s a useful thing that folks get something out of. In addition to the science in this paper, I’m also really excited about the process that we used to accomplish this research and to make it as reproducible as possible. So, stay tuned for some follow up posts on big data in ecology, collaborative code development, and making ecological research more reproducible.
———————————————————————————————————————————————————————————————
*The paper will be Open Access once it is officially published but ,for reasons that don’t make a lot of sense to me, it is behind a paywall until it comes out in print.
On the value of fundamental scientific research
Jeremy Fox over at the Oikos Blog has written an excellent piece explaining why fundamental, basic science, research is worth investing in, even when time and resources are limited. His central points include:
- Fundamental research is where a lot of our methodological advances come from.
- Fundamental research provides generally-applicable insights.
- Current applied research often relies on past fundamental research.
- Fundamental research often is relevant to the solution of many different problems, but in diffuse and indirect ways.
- Fundamental research lets us address newly-relevant issues.
- Fundamental research alerts us to relevant questions and possibilities we didn’t recognize as relevant.
- Fundamental research suggests novel solutions to practical problems.
- The only way to train fundamental researchers is to fund fundamental research.
I don’t have a lot to add to what Jeremy has already said, except that I strongly agree with the points that he has made and think that in an era where much of ecology has direct applications to things like global change we need to guard against the temptation to justify all of our research based on its applications.
When I think about the value of fundamental research I always recall a scene from an early season of The West Wing where a politician (SAM) and a scientist (MILLGATE) are discussing how to explain the importance of something akin to the Large Hadron Collider. It loses a little something as a script (complements of Unofficial West Wing Transcript Archive), but nonetheless:
SAM
What is it?MILLGATE
It’s a machine that reveals the origin of matter… By smashing protons together at very high speeds and at very high temperatures, we can recreate the Big Bang in a laboratory setting, creating the kinds of particles that only existed in the first trillionth of a second after the universe was created.SAM
Okay, terrific. I understand that. What kind of practical applications does it have?MILLGATE
None at all.SAM
You’re not in any way a helpful person.MILLGATE
Don’t have to be. I have tenure.SAM
Doctor.MILLGATE
There are no practical applications, Sam. Anybody who says different is lying.…
ENLOW
If only we could only say what benefit this thing has, but no one’s been able to do that.MILLGATE
That’s because great achievement has no road map. The X-ray’s pretty good. So is penicillin. Neither were discovered with a practical objective in mind. I mean, when the electron was discovered in 1897, it was useless. And now, we have an entire world run by electronics. Haydn and Mozart never studied the classics. They couldn’t. They invented them.SAM
Discovery.MILLGATE
What?SAM
That’s the thing that you were… Discovery is what. That’s what this is used for. It’s for discovery.
The episode is “Dead Irish Writers” and I’d highly recommend watching the whole thing if you want to feel inspired about doing fundamental research.
Why I will no longer review for your journal
I have, for a while, been frustrated and annoyed by the behavior of several of the large for-profit publishers. I understand that their motivations are different from my own, but I’ve always felt that an industry that relies entirely on both large amounts of federal funding (to pay scientists to do the research and write up the results) and a massive volunteer effort to conduct peer review (the scientists again) needed to strike a balance between the needs of the folks doing all of the work and the corporations need to maximize profits.
Despite my concerns about the impacts of increasingly closed journals, with increasingly high costs, on the dissemination of research and the ability of universities to support their core missions of teaching and research, I have continued to volunteer my time and effort as a reviewer to Elsevier and Wiley-Blackwell. I did this because I have continued to see valuable contributions made by these journals and I felt that this combined with the contribution that I was making to science by helping improve the science published in high profile places made supporting these journals worthwhile. I no longer believe this to be the case and from now on I will no longer be reviewing for any journal that is published by Elsevier, Springer, or Wiley-Blackwell (including society journals that publish through them).
Why have I changed my mind? Because of the pursuit/support by these companies of the Research Works Act. This act seeks to prevent funding agencies from requiring that the results of research that they funded be made publicly available. In other words it seeks to prevent the government (and the taxpayers that fund it), which pays for a very large fraction of the cost of any given paper through both funding the research and paying the salaries of reviewers and editors, from having any say in how that research is disseminated. I think that Mike Taylor in the Guardian said most clearly how I feel about this attempt to exert legislative control requiring us to support corporate profits over the dissemination of scientific research:
Academic publishers have become the enemies of science
This is the moment academic publishers gave up all pretence of being on the side of scientists. Their rhetoric has traditionally been of partnering with scientists, but the truth is that for some time now scientific publishers have been anti-science and anti-publication. The Research Works Act, introduced in the US Congress on 16 December, amounts to a declaration of war by the publishers.
You should read the entire article. It’s powerful. There are lots of other great articles about the RWA including Michael Eisen in the New York Times, a nice post by INNGE, and a interesting piece by Paul Krugman (via oikosjeremy). I’m also late to the party in declaring my peer review strike and less eloquent than many of my peers in explaining why (see great posts by Michael Taylor, Gavin Simpson, and Timothy Gowers). But I’m here now and I’m letting you know so that you can consider whether or not you also want to stop volunteering for companies that don’t have science’s best interests in mind.
If you’d like to read up on the publisher’s side of this argument (they have costs, they have a right to recoup them) you can see Springer’s official position or an Elsevier Exec’s exchange with Michael Eisen. My problem with all of these arguments is that there is nothing in any funding agency’s policy that requires publishers to publish work funded by that agency. This is not (as Springer has argued) an “unfunded mandate”, this is a stake holder that has certain requirements related to the publication of research in which they have an interest. This is just like an author (in any non-academic publishing situation) negotiating with a publisher. If the publisher doesn’t like the terms that the author demands, then they don’t have to publish the book. Likewise, if a publisher doesn’t like the NIH policy then they should simply not agree to publish NIH funded research.
To be clear, I am not as extreme in my position as some. I still support and will review for independent society journals like Ecology and American Naturalist even though they aren’t Open Access and even though ESA has made some absurd comments in support of the same ideas that are in RWA. The important thing for me is that these journals have the best interests of science in mind, even if they are often frustratingly behind the times in how they think and operate.
And don’t worry, I’ve still got plenty of journal related work to keep me busy, thanks to my new position on the editorial board at PLoS ONE.
UPDATE: The links to the INNGE and Timothy Gowers post have now been fixed, and here are links to a couple of great posts by Casey Bergman that I somehow left out: one on how to turn down reviews while making a point and one on the not so positive response he received to one of these emails.
UPDATE 2: A great collection of posts on RWA. There are a lot of really unhappy scientists out there.
UPDATE 3: A formal Boycott of Elsevier. Almost 1000 scientists have signed on so far.
UPDATE 4: Wiley-Blackwell has now distanced itself from RWA and said that “We do not believe that legislative initiatives are the best way forward at this time and so have no plans to endorse RWA. Instead we believe that research funder-publisher partnerships will be more productive.” In addition, it was announced that a bill that would do the opposite of RWA has now been introduced. Hooray for collective action!
A new database for mammalian community ecology and macroecology
There are a number of great datasets available for doing macroecology and community ecology at broad spatial scales. These include data on birds (Breeding Bird Survey, Christmas Bird Count), plants (Forest Inventory & Analysis, Gentry’s transects), and insects (North American Butterfly Association Counts). However, if you wanted to do work that relied on knowing the presence or abundance of individuals at particular sites (i.e., you’re looking for something other than range maps) there has never been a decent dataset to work with for mammals.
Announcing the Mammal Community Database (MCDB)
Over the past couple of years we’ve been working to fill that gap as best we could. Since coordinated continental scale surveys of mammals don’t yet exist [1] we dug into the extensive mammalogy literature and compiled a database of 1000 globally distributed communities. Thanks to Kate Thibault‘s leadership and the hard work of Sarah Supp and Mikaelle Giffen, we are happy to announce that this data is now freely available as a data paper on Ecological Archives.
In addition to containing species lists for 1000 locales, there is abundance data for 940 of the locations, some site level body size data (~50 sites) and a handful of reasonably long (> 10 yr) time-series as well. Most of the data is restricted to the particular mode of sampling that an individual mammalogist uses and as a result much of the data is for small mammals captured in Sherman traps.
Working with data compilations like this is always difficult because the differences in sampling intensity and approaches between studies can make it very difficult to compare data across sites. We’ve put together a detailed table of information on how sampling was conducted to help folks break the data into comparable subsets and/or attempt to control for the influence of sampling differences in their statistical models.
The joys of Open Science
We’ve been gradually working on making the science that we do at Weecology more and more open, and the MCDB is an example of that. We submitted the database to Ecological Archives before we had actually done much of anything with it ourselves [2], because the main point of collecting the data was to provide a broadly useful resource to the ecological community, not to answer a specific question. We were really excited to see that as soon as we announced it on Twitter
https://twitter.com/weecology/status/152158777385295872folks started picking it up and doing cool things with it [3]. We hope that folks will find all sorts of uses for it going forward.
Going forward
We know that there is tons more data out there on mammal communities. Some of it is unpublished, or not published in enough detail for us to include. Some of it has licenses that mean that we can’t add it to the MCDB without special permission (e.g., there is a lot of great LTER mammal data out there). Lots of it we just didn’t find while searching through the literature.
If folks know of more data we’d love to hear about it. If you can give us permission to add data that has more restrictive licensing then we’d love to do so [4]. If you’re interested in collaborating on growing the database let us know. If there’s enough interest we can invest some time in developing a public portal.
The footnotes [5]
[1] We are anxiously awaiting NEON’s upcoming surveys, headed up by former Weecology postdoc Kate Thibault.
[2] We have a single paper that is currently in review that uses the data.
[3] Thanks to Scott Chamberlain and Markus Gesmann. You guys are awesome!
[4] To be clear, we haven’t been asking for permission yet, so no one has turned us down. We wanted to get the first round of data collection done first to show that this was a serious effort.
[5] Because anything that David Foster Wallace loved has to be a good thing.
Weecology at ESA
If folks are interested in seeing what Weecology has been up to lately we have a bunch of posters and talks at ESA this year. In order of appearance:
- Tuesday at 2:30 pm in Room 9AB our new postdoctoral researcher Dan McGlinn will be giving a talk on looking at community assembly using patterns of with- and between-species spatial variation.
- Tuesday afternoon at poster #28 Morgan will be presenting research on how the long-term community dynamics of the plant and rodent communities near Portal, AZ are related to decadal scale climate cycles. She’ll be there from 4:30 to 6:30 to chat, or stop by any time and take a look.
- Wednesday at 1:50 pm in Room 19A one of our new members, Elita Baldridge, will be giving a talk on her masters research on nested subsets.
- Wednesday at poster #139 Ethan will be presenting on our two attempts to make it easier to find and use ecological data. He’ll be there from 4:30 to 6:30 to chat, or stop by any time and take a look (or grab a computer and check out EcologicalData and the EcoData Retriever).
- Thursday at 1:50 pm in Room 10A another of our new members, Zack Brym, will be giving a talk on his masters research on controls on the invasion of an exotic shrub.
- Thursday at 4 pm in Room 8 Sarah Supp will give a talk on her work looking at the impacts of experimental manipulations on macroecological patterns (highlighted as a talk to see by Oiko’s blog)
- And last, but certainly not least, bright and early Friday morning at 8 am in Room 8 Kate Thibault (who has now moved on to fame and fortune at NEON) will be presenting on our work using maximum entropy models to predict the species abundance distributions of 16,000 communities.
Enjoy!
Some meandering thoughts on the difference between EcologicalData.org and DataONE
In the comments of my post on the Ecological Data Wiki Jarrett Byrnes asked an excellent question:
Very cool. I’m curious, how do you think this will compare/contrast/fight with the Data One project – https://www.dataone.org/ – or is this a different beast altogether?
As I started to answer it I realized that my thoughts on the matter were better served by a full post, both because they are a bit lengthy and because I don’t actually know much about DataONE and would love to have some of their folks come by, correct my mistaken impressions, and just chat about this stuff in general.
To begin with I should say that I’m still trying to figure this out myself, both because I’m still figuring out exactly what DataONE is going to be, and because EcologicalData is still evolving. I think that both projects goals could be largely defined as “Organizing Ecology’s Data,” but that’s a pretty difficult task, involving a lot of components and a lot of different ways to tackle them. So, my general perspective is that the more folks we have trying the merrier. I suspect there will be plenty of room for multiple related projects, but I’d be just as happy (even happier probably) if we could eventually find a single centralized location for handling all of this. All I want is solution to the challenge.
But, to get to the question at hand, here are the differences I see based on my current understanding of DataONE:
1. Approach. There are currently two major paradigms for organizing large amounts of information. The first is to figure out a way to tell computers how to do it for us (e.g., Google), the second is to crowdsource it’s development and curation (e.g., Wikipedia). DataONE is taking the computer based approach. It’s heavy on metadata, ontologies, etc. The goal is to manage the complexities of ecological data by providing the computer with very detailed descriptions of the data that it can understand. We’re taking the human approach, keeping things simple and trying to leverage the collective knowledge and effort of the field. As part of this difference in approach I suspect that EcologicalData will be much more interactive and community driven (the goal is for the community to actually run the site, just like Wikipedia) whereas DataONE will tend to be more centralized and hierarchical. I honestly couldn’t tell you which will turn out better (perhaps the two approaches will each turn out to be better for different things) but I’m really glad that we’re trying both at the same time to figure out what will work and where their relative strengths might be.
2. Actually serving data. DataONE will do this; we won’t. This is part of the difference in approach. If the computer can handle all of the thinking with respect to the data then you want it to do that and just spit out what you want. Centralizing the distribution of heterogeneous data is a really complicated task and I’m excited the folks at DataONE are tackling the challenge.
a. One of the other challenges for serving data is that is that you have to get all of the folks who “own” the data to let you provide it. This is one of the reasons I came up with the Data Wiki idea. By serving as a portal it helps circumvent the challenges of getting all of the individual stake holders to agree to participate.
b. We do provide a tool for data acquisition, the EcoData Retriever, that likewise focuses on circumventing the need to negotiate with data providers by allowing each individual investigator to automatically download the data from the source. But, this just sets up each dataset independently, whereas I’m presuming that DataONE will let you just run one big query of all the data (which I’m totally looking forward to by the way) [1].
3. Focus. The primary motivation behind the Data Wiki goes beyond identifying datasets and really focuses on how you should use them. Having worked with other folks’ data for a number of years I can say that the biggest challenging (for me anyway) is actually figuring out all of the details of when and how the dataset should be used. This isn’t just a question of reading metadata either. It’s a question of integrating thoughts and approaches from across the literature. What I would like to see develop on the Data Wiki pages is the development of concise descriptions for how to go about using these datasets in the best way possible. This is a very difficult task to automate and one where I think a crowdsourced solution is likely the most effective. We haven’t done a great job of this yet, but Allen Hurlbert and I have some plans to develop a couple of good examples early in the fall to help demonstrate the idea.
4. We’re open for business. Ha ha, eat our dust DataONE. But seriously, we’ve taken a super simple approach which means we can get up and running quickly. DataONE is doing something much more complicated and so things may take some time to roll out. I’m hoping to get a better idea of what their time lines look like at ESA. I’m sure their tools will be well worth the wait.
5. Oh, and their budget is a little over $2,000,000/year, which is just slightly larger than our budget of around $5,000/year.
So, there is my lengthy and meandering response to Jarrett’s question. I’m looking forward to chatting with DataONE folks at ESA to find out more about what they are up to, and I’d love to have them stop by here to chat and clear up my presumably numerous misconceptions.
——————————————————————————————————————-
[1] Though we do have some ideas for managing something somewhat similar, so stay tuned for EcoData Retriever 2.0. Hopefully coming to an internet near you sometime this spring.
Distributed Ecology [Blogrolling]
I’ve been waiting for a while now for Ted Hart’s blog to get up enough steam to send folks over there, and since in the last two weeks he’s had three posts, revamped the mission of the blog, and engaged in the ongoing conversation about Lindenmayer & Likens, it seems like that time has arrived.
The blog is called Distributed Ecology because, as Ted describes,
I chose distributed ecology as a title because I like the idea of ecological thought like distributed computing. Lots of us scientists like little nodes around the web thinking and processing ideas into something great.
Sounds like what I’m hoping to see (and am increasingly witnessing) from the ecology blogs. So, head on over, check it out, click on the RSS button, and welcome Ted to the ecology blogging community.
A Plea for Pluralism
As you may have seen earlier either on Jabberwocky, EEB and Flow, or over at Oikos‘ new blog, the most recent piece about how some branch of ecology is ruining ecology has caused some discussion in the blogosphere. Everytime one of these comes out, I tell myself I’m going to write a blog post but then I think, “that’s just one cranky person,” and i get distracted doing science that is killing ecology (Given the plethora of opinions about what is ruining our field, odds are you too are killing ecology, regardless of what type of science you do). But as these opinion pieces keep emerging, I have increasingly come to feel that these debates on the ‘best’ approach reflect a very limited view of the scientific endeavor. Every approach (field ecology, microcosms, theory, meta-analysis, macroecology, insert your favorite approach that I’ve missed here) is fundamentally limited in its scope, focus, and ability to divine answers from nature, yet has unique strengths in what it allows us to do. Theory is abstracted from nature, but can also provide a concrete set of expectations and processes for empiricists to work with. Microcosms, while similarly critiqued for their abstraction from reality, can also give the clearest indication about whether ideas and theories work (or don’t) under the most ideal scenarios. Field ecology (particularly experimental manipulation) has been considered the gold standard for its ability to show cause and effect in ‘real’ ecosystems, but it is also messy, expensive, time-consuming (I say this thinking of my own field site, perhaps yours is less so) and in a natural setting it is impossible to have control over all of the important (and potentially confounding) variables. Macroecology and meta-analysis allow us to step back from individual systems and taxa to ask whether patterns and processes are general across nature, general within certain subsets of systems, or unpredictably important (and unimportant). However they lack the ability to manipulate nature directly to tease out cause and effect more definitively. Because all approaches have limitations, the exclusive use of any one approach is guaranteed to give us a limited and possibly flawed view of reality. In the scientific utopia that lives in my head, these different approaches to addressing scientific questions live together harmoniously, results from one approach generate questions best addressed with another approach and the cumulative evidence from all approaches give us a more complete understanding of nature. When I read opinion pieces that advocate for a particular approach above all others, I worry that this utopia only exists in my head. After all, those opinion pieces never seem to be balanced by a counter argument for plurality. But then sometimes I read things – often on the internet – and I think: it may be in my head, but maybe my head is not the only one that dream resides in.