Author Archives: Ethan White
EcoData Retriever: quickly download and cleanup ecological data so you can get back to doing science
If you’ve every worked with scientific data, your own or someone elses, you know that you can end up spending a lot of time just cleaning up the data and getting it in a state that makes it ready for analysis. This involves everything from cleaning up non-standard nulls values to completely restructuring the data so that tools like R, Python, and database management systems (e.g., MS Access, PostgreSQL) know how to work with them. Doing this for one dataset can be a lot of work and if you work with a number of different databases like I do the time and energy can really take away from the time you have to actually do science.
Over the last few years Ben Morris and I been working on a project called the EcoData Retriever to make this process easier and more repeatable for ecologists. With a click of a button, or a single call from the command line, the Retriever will download an ecological dataset, clean it up, restructure and assemble it (if necessary) and install it into your database management system of choice (including MS Access, PostgreSQL, MySQL, or SQLite) or provide you with CSV files to load into R, Python, or Excel.
Just click on the box to get the data:
Or run a command like this from the command line:
retriever install msaccess BBS --file myaccessdb.accdb
This means that instead of spending a couple of days wrangling a large dataset like the North American Breeding Bird Survey into a state where you can do some science, you just ask the Retriever to take care of it for you. If you work actively with Breeding Bird Survey data and you always like to use the most up to date version with the newest data and the latest error corrections, this can save you a couple of days a year. If you also work with some of the other complicated ecological datasets like Forest Inventory and Analysis and Alwyn Gentry’s Forest Transect data, the time savings can easily be a week.
The Retriever handles things like:
- Creating the underlying database structures
- Automatically determining delimiters and data types
- Downloading the data (and if there are over 100 data files that can be a lot of clicks)
- Transforming data into standard structures so that common tools in R and Python and relational database management systems know how to work with it (e.g., converting cross-tabulated data)
- Converting non-standard null values (e.g., 999.0, -999, NoData) into standard ones
- Combining multiple data files into single tables
- Placing all related tables in a single database or schema
The EcoData Retriever currently includes a number of large, openly available, ecological datasets (see a full list here). It’s also easy to add new datasets to the EcoData Retriever if you want to. For simple data tables a Retriever script can be as simple as:
name: Name of the dataset description: A brief description of the dataset of ~25 words. shortname: A one word name for the dataset table: MyTableName, http://awesomedatasource.com/dataset
We also have some exciting new features on the To Do list including:
- Automatically cleaning up the taxonomy using existing services
- Providing detailed tracking of the provenance of your data by recording the date it was downloaded, the version of the software used, and information about what cleanup steps the Retriever performed
- Integration into R and Python
Let us know what you think we should work on next in the comments.
This is a guest post by Elita Baldridge (@elitabaldridge). She is a graduate student in our group who has been navigating the development of a chronic illness during graduate school. She is sharing her story to help spread awareness of the challenges faced by graduate students with chronic illnesses. She wrote an excellent post on the PhDisabled blog about the initial development of her illness that I encourage you to read first.
During my time as a Ph.D. student, I developed a host of bizarre, productivity eating symptoms, and have been trying to make progress on my dissertation while also spending a lot of time at doctors’ offices trying to figure out what is wrong with me. I wrote an earlier blog post about dealing with the development of a chronic illness as a graduate student at the PhDisabled Blog.
When the rheumatologist handed me a yellow pamphlet labeled “Fibromyalgia”, I felt a great sense of relief. My mystery illness had a diagnosis, so I had a better idea of what to expect. While chronic, at least fibromyalgia isn’t doing any permanent damage to joints or brain. However, there isn’t a lot known about it, the treatment options are limited, and the primary literature is full of appallingly small sample sizes.
There are many symptoms which basically consisting of feeling like you have the flu all the time, with all the associated aches and pains. The worst one for me, because it interferes with my highly prized ability to think, is the cognitive dysfunction, or, in common parlance, “fibro fog”. This is a problem when you are actively trying to get research done, as sometimes you remember what you need to do, but can’t quite figure out how navigating to your files in your computer works, what to do with the mouse, or how to get the computer on. I frequently finish sentences with a wave of my hand and the word “thingy”. Sometimes I cannot do simple math, as I do not know what the numbers mean, or what to do next. Depending on the severity, the cognitive dysfunction can render me unable to work on my dissertation as I simply cannot understand what I am supposed to do. I’m not able to drive anymore, due to the general fogginess, but I never liked driving that much anyway. Sometimes I need a cane, because my balance is off or I cannot walk in a straight line, and I need the extra help. Sometimes I can’t be in a vertical position, because verticality renders me so dizzy that I vomit.
I am actually doing really well for a fibromyalgia patient. I know this, because the rheumatologist who diagnosed me told me that I was doing remarkably well. I am both smug that I am doing better than average, because I’m competitive that way, and also slightly disappointed that this level of functioning is the new good. I would have been more disappointed, only I had a decent amount of time to get used to the idea that whatever was going on was chronic and “good” was going to need to be redefined. My primary care doctor had already found a medication that relieved the aches and pains before I got an official diagnosis. Thus, before receiving an official diagnosis, I was already doing pretty much everything that can be done medication wise, and I had already figured out coping mechanisms for the rest of it. I keep to a strict sleep schedule, which I’ve always done anyway, and I’ve continued exercising, which is really important in reducing the impact of fibromyalgia. I should be able to work up my exercise slowly so that I can start riding my bicycle short distances again, but the long 50+ mile rides I used to do are probably out.
Fortunately, my research interests have always been well suited to a macroecological approach, which leaves me well able to do science when my brain is functioning well enough. I can test my questions without having to collect data from the field or lab, and it’s easy to do all the work I need to from home. My work station is set up right by the couch, so I can lay down and rest when I need to. I have to be careful to take frequent breaks, lest working too long in one position cause a flare up. This is much easier than going up to campus, which involves putting on my healthy person mask to avoid sympathy, pity, and questions, and either a long bus ride or getting a ride from my husband. And sometimes, real people clothes and shoes hurt, which means I’m more comfortable and spending less energy if I can just wear pajamas and socks, instead of jeans and shoes.
Understand that I am not sharing all of this because I want sympathy or pity. I am sharing my experience as a Ph.D. student developing and being diagnosed with a chronic illness because I, unlike many students with any number of other short term or long term disabling conditions, have a lot of support. Because I have a great deal of family support, departmental support, and support from the other Weecologists and our fearless leaders, I should be able to limp through the rest of my Ph.D. If I did not have this support, it is very likely that I would not be able to continue with my dissertation. If I did not have support from ALL of these sources, it is also very likely that I would not be able to continue. While I hope that I will be able contribute to science with my dissertation, I also think that I can contribute to science by facilitating discussion about some of the problems that chronically ill students face, and hopefully finding solutions to some of those problems. To that end, I have started an open GitHub repository to provide a database of resources that can help students continue their training and would welcome additional contributions. Unfortunately, there doesn’t seem to be a lot. Many medical Leave of Absence programs prevent students from accessing university resources- which also frequently includes access to subsidized health insurance and potentially the student’s doctor, as well as removing the student from deferred student loans.
I have fibromyalgia. I also have contributions to make to science. While I am, of course, biased, I think that some contribution is better than no contribution. I’d rather be defined by my contributions, rather than my limitations, and I’m glad that my university and my lab aren’t defining me by my limitations, but are rather helping me to make contributions to science to the best of my ability.
The British Ecological Society has announced that will now allow the submission of papers with preprints (formal language here). This means that you can now submit preprinted papers to Journal of Ecology, Journal of Animal Ecology, Methods in Ecology and Evolution, Journal of Applied Ecology, and Functional Ecology. By allowing preprints BES joins the Ecological Society of America which instituted a pro-preprint policy last year. While BES’s formal policy is still a little more vague than I would like*, they have confirmed via Twitter that even preprints with open licenses are OK as long as they are not updated following peer review.
Preprints are important because they:
- Speed up the progress of science by allowing research to be discussed and built on as soon as it is finished
- Allow early career scientists to establish themselves more rapidly
- Improve the quality of published research by allowing a potentially large pool reviewers to comment on and improve the manuscript (see our excellent experience with this)
BES getting on board with preprints is particularly great news because the number of ecology journals that do not allow preprints is rapidly shrinking to the point that ecologists will no longer need to consider where they might want to submit their papers when deciding whether or not to post preprints. The only major blocker at this point to my mind is Ecology Letters. So, my thanks to BES for helping move science forward!
*Which is why I waited 3 weeks for clarification before posting.
This is a guest post by Dan McGlinn, a weecology postdoc (@DanMcGlinn on Twitter). It is a Research Summary of: McGlinn, D.J., X. Xiao, and E.P. White. 2013. An empirical evaluation of four variants of a universal species–area relationship. PeerJ 1:e212 http://dx.doi.org/10.7717/peerj.212. These posts are intended to help communicate our research to folks who might not have the time, energy, expertise, or inclination to read the full paper, but who are interested in a <1000 general language summary.
It is well established in ecology that if the area of a sample is increased you will in general see an increase in the number species observed. There are a lot of different reasons why larger areas harbor more species: larger areas contain more individuals, habitats, and environmental variation, and they are likely to cross more barriers to dispersal – all things that promote more species to be able to exist together in an area. We typically observe relatively smooth and simple looking increases in species number with area. This observation has mystified ecologists: How can a pattern that should be influenced by many different and biologically idiosyncratic processes appear so similar across scales, taxonomic groups, and ecological systems?
Recently a theory was proposed (Harte et al. 2008, Harte et al. 2009) which suggests that detailed knowledge of the complex processes that influence the increase in species number may not be necessary to accurately predict the pattern. The theory proposes that ecological systems tend to simply be in their most likely configuration. Specifically, the theory suggests that if we have information on the total number of species and individuals in an area then we can predict the number of species in smaller portions of that area.
Published work on this new theory suggests that it has potential for accurately predicting how species number changes with area; however, it has not been appreciated that there are actually four different ways that the theory can be operationalized to make a prediction. We were interested to learn
- Can the theory accurately predict how species number changes with area across many different ecological systems, and
- Do the different versions of the theory consistently perform better than others
To answer these questions we needed data. We searched online and made requests to our colleagues for datasets that documented the spatial configuration of ecological communities. We were able to pull together a collection of 16 plant community datasets. The communities spanned a wide range of systems including hyper-diverse, old-growth tropical forests, a disturbance prone tropical forest, temperate oak-hickory and pine forests, a Mediterranean mixed-evergreen forest, a low diversity oak woodland, and a serpentine grassland.
Fig 1. A) Results from one of the datasets, the open circles display the observed data and the lines are the four different versions of the theory we examined. B) A comparison of the observed and predicted number of species across all areas and communities we examined for one of the versions of the theory.
Across the different communities we found that the theory was generally quite accurate at predicting the number of species (Fig 1 above), and that one of the versions of the theory was typically better than the others in terms of the accuracy of its predictions and the quantity of information it required to make predictions. There were a couple of noteworthy exceptions in our results. The low diversity oak woodland and the serpentine grassland both displayed unusual patterns of change in richness. The species in the serpentine grassland were more spatially clustered than was typically observed in the other communities and thus better described by the versions of the theory that predicted stronger clustering. Abundance in the oak woodland was primarily distributed across two species whereas the other 5 species where only observed once or twice. This unusual pattern of abundance resulted in a rather unique S-shaped relationship between the number of species and area and required inputting the observed species abundances to accurately model the pattern.
The two key findings from our study were
- The theory provides a practical tool for accurately predicting the number of species in sub-samples of a given site using only information on the total number of species and individuals in that entire area.
- The different versions of the theory do make different predictions and one appears to be superior
Of course there are still a lot of interesting questions to address. One question we are interested in is whether or not we can predict the inputs of the theory (total number of species and individuals for a community) using a statistical model and then plug those predictions into the theory to generate accurate fine-scaled predictions. This kind of application would be important for conservation applications because it would allow scientists to estimate the spatial pattern of rarity and diversity in the community without having to sample it directly. We are also interested in future development of the theory that provides predictions for the number of species at areas that are larger (rather than smaller) than the reference point which may have greater applicability to conservation work.
The accuracy of the theory also has the potential to help us understand the role of specific biological processes in shaping the relationship between species number and area. Because the theory didn’t include any explicit biological processes, our findings suggest that specific processes may only influence the observed relationship indirectly through the total number of species and individuals. Our results do not suggest that biological processes are not shaping the relationship but only that their influence may be rather indirect. This may be welcome news to practitioners who rely on the relationship between species number and area to devise reserve designs and predict the effects of habitat loss on diversity.
Harte, J., A. B. Smith, and D. Storch. 2009. Biodiversity scales from plots to biomes with a universal species-area curve. Ecology Letters 12:789–797.
Harte, J., T. Zillio, E. Conlisk, and A. B. Smith. 2008. Maximum entropy and the state-variable approach to macroecology. Ecology 89:2700–2711.
Doing science in academia involves a lot of rejection and negative feedback. Between grant agencies single digit funding rates, pressure to publish in a few "top" journals all of which have rejection rates of 90% or higher , and the growing gulf between the number of academic jobs and the number of graduate students and postdocs , spending even a small amount of time in academia pretty much guarantees that you’ll see a lot of rejection. In addition, even when things are going well we tend to focus on providing as much negative feedback as possible. Paper reviews, grant reviews, and most university evaluation and committee meetings are focused on the negatives. Even students with awesome projects that are progressing well and junior faculty who are cruising towards tenure have at least one meeting a year where someone in a position of power will try their best to enumerate all of things you could be doing better . This isn’t always a bad thing  and I’m sure it isn’t restricted to academia or science (these are just the worlds I know), but it does make keeping a positive attitude and reasonable sense of self-worth a bit… challenging.
One of the things that I do to help me remember why I keep doing this is my Why File. It’s a file where I copy and paste reminders of the positive things that happen throughout the year . These typically aren’t the sort of things that end up on my CV. I have my CV for tracking that sort of thing and frankly the number of papers I’ve published and grants I’ve received isn’t really what gets me out of bed in the morning. My Why File contains things like:
- Email from students in my courses, or comments on evaluations, telling me how much of an impact the skills they learned have had on their ability to do science
- Notes from my graduate students, postdocs, and undergraduate researchers thanking me for supporting them, inspiring them, or giving them good advice
- Positive feedback from mentors and people I respect that help remind me that I’m not an impostor
- Tweets from folks reaffirming that an issue or approach I’m advocating for is changing what they do or how they do it
- Pictures of thank you cards or creative things that people in my lab have done
- And even things that in a lot of ways are kind of silly, but that still make me smile, like screen shots of being retweeted by Jimmy Wales or of Tim O’Reilly plugging one of my papers.
If you’ve said something nice to me in the past few years be it in person, by email, on twitter, or in a handwritten note, there’s a good chance that it’s in my Why File helping me keep going at the end of a long week or a long day. And that’s the other key message of this post. We often don’t realize how important it is to say thanks to the folks who are having a positive influence on us from time to time. Or, maybe we feel uncomfortable doing so because we think these folks are so talented and awesome that they don’t need it, or won’t care, or might see this positive feedback as silly or disingenuous. Well, as Julio Betancourt once said, "You can’t hug your reprints", so don’t be afraid to tell a mentor, a student, or a colleague when you think they’re doing a great job. You might just end up in their Why File.
What do you do to help you stay sane in academia, science, or any other job that regularly reminds you of how imperfect you really are?
 This idea that where you publish not what you publish is a problem, but not the subject of this post.
 There are lots of great ways to use a PhD, but unfortunately not everyone takes that to heart.
 Of course the people doing this are (at least sometimes) doing so with the best intentions, but I personally think it would be surprisingly productive to just say, "You’re doing an awesome job. Keep it up." every once in a while.
 There is often a goal to the negativity, e.g., helping a paper or person reach their maximum potential, but again I think we tend to undervalue the consequences of this negativity in terms of motivation [4b].
[4b] Hmm, apparently I should write a blog post on this since it now has two footnotes worth of material.
 I use a Markdown file, but a simple text file or a MS Word document would work just fine as well for most things.
Academic publishing is in a dynamic state these days with large numbers of new journals popping up on a regular basis. Some of these new journals are actively experimenting with changing traditional approaches to publication and peer review in potentially important ways. So, I thought I’d provide a quick introduction to some of the new kids on the block that I think have the potential to change our approach to academic publishing.
PeerJ is in some ways a fairly standard PLOS One style open access journal. Like PLOS One they only publish primary research (no reviews or opinion pieces) and that research is evaluated only on the quality of the science not on its potential impact. However, what makes PeerJ different (and the reason that I’m volunteering my time as an associate editor for them) is their philosophy that in the era of the modern web it should it should be both cheap and easy to publish scientific papers:
We aim to drive the costs of publishing down, while improving the overall publishing experience, and providing authors with a publication venue suitable for the 21st Century.
The pricing model is really interesting. Instead of a flat fee per paper PeerJ uses a lifetime author memberships. For $99 (total for life) you can publish 1 paper/year. For $199 you can publish 2 papers/year and for $299 you can publish unlimited papers for life. Every author has to have a membership so for a group of 5 authors publishing in PeerJ for the first time it would cost $495, but that’s still about 1/3 of what you’d pay at PLOS One and 1/6 of what you’d pay to make a paper open access at a Wiley journal. And that same group of authors can publish again next year for free. How can they publish for so much less than anyone else (and whether it is sustainable) is a bit of open question, but they have clearly spent a lot of time (and serious publishing experience) thinking about how to automate and scale publication in an affordable manner both technically and in terms things like typesetting (since single column text no attempt to wrap text around tables and figures is presumably much easier to typeset). If you “follow the money” as Brian McGill suggests then the path may well lead you to PeerJ.
Other cool things about PeerJ:
- Optional open review (authors decide whether reviews are posted with accepted manuscripts, reviewers decide whether to sign reviews)
- Ability to comment on manuscripts with points being given for good comments.
- A focus on making life easy for authors, reviewers, and editors, including a website that is an absolute joy compared to interact with and a lack of rigid formatting guidelines that have to be satisfied for a paper to be reviewed.
We want authors spending their time doing science, not formatting. We include reference formatting as a guide to make it easier for editors, reviewers, and PrePrint readers, but will not strictly enforce the specific formatting rules as long as the full citation is clear. Styles will be normalized by us if your manuscript is accepted.
Now there’s a definable piece of added value.
Faculty of 1000 Research
Faculty of 1000 Research‘s novelty comes from a focus on post-publication peer review. Like PLOS One & PeerJ it reviews based on quality rather than potential impact, and it has a standard per paper pricing model. However, when you submit a paper to F1000 it is immediately posted publicly online, as a preprint of sorts. They then contact reviewers to review the manuscript. Reviews are posted publicly with the reviewers names. Each review includes a status designation of “Approved” (similar to Accept or Minor Revisions), “Approved with Reservations” (similar to Major Revisions), and “Not Approved” (similar to Reject). Authors can upload new versions of the paper to satisfy reviewers comments (along with a summary/explanation of the changes made), and reviewers can provide new reviews and new ratings. If an article receives two “Approved” ratings or one “Approved” and two “Approved with Reservations” ratings then it is considered accepted. It is then identified on the site as having passed peer review, and is indexed in standard journal databases. The peer review process is also open to anyone, so if you want to write a review of a paper you can, no invite required.
It’s important to note that the individuals who are invited to review the paper are recommended by the authors. They are checked to make sure that they don’t have conflicts of interest and are reasonably qualified before being invited, but there isn’t a significant editorial hand in selecting reviewers. This could be seen as resulting in biased reviews, since one is likely to select reviewers that may be biased towards liking you work. However, this is tempered by the fact that the reviewers name and review are publicly attached to the paper, and therefore they are putting their scientific reputation on the line when they support a paper (as argued more extensively by Aarssen & Lortie 2011).
In effect, F1000 is modeling a system of exclusively post-publication peer review, with a slight twist of not considering something “published/accepted” until a minimum number of positive reviews are received. This is a bold move since many scientists are not comfortable with this model of peer review, but it has the potential to vastly speed up the rate of scientific communication in the same way that preprints do. So, I for one think this is an experiment worth conducting, which is why I recently reviewed a paper there.
Oh, and ecologists can currently publish there for free (until the end of the year).
Frontiers in X
I have the least personal experience with the Frontiers’ journals (including the soon to launch Frontiers in Ecology & Evolution). Like F1000Research the ground breaking nature of Frontiers is in peer review, but instead of moving towards a focus on post-publication peer review they are attempting to change how pre-publication review works. They are trying to make review a more collaborative effort between reviewers and authors to improve the quality of the paper.
As with PeerJ and F1000Research, Frontiers is open access and has a review process that focuses on “the accuracy and validity of articles, not on evaluating their significance”. What makes Frontiers different is their two step review process. The first step appears to be a fairly standard pre-publication peer review, where “review editors” provide independent assessments of the paper. The second step (the “Interactive Review phase”) is where the collaboration comes in. Using an “Interactive Review Forum” the authors and all of the reviewers (and if desirable the associate editor and even the editor in chief for the subdiscipline) work collaboratively to improve the paper to the point that the reviewers support its publication. If disagreements arise the associate editor is tasked with acting as a mediator in the conversation. If a paper is eventually accepted then the reviewers names are included with the paper and taken as indicating that they sign off on the quality of the paper (see Aarssen & Lortie 2011 for more discussion of this idea; reviewers can withdraw from the process at any point in which case their names are not included).
I think this is an interesting approach because it attempts to make the review process a friendlier and more interactive process that focuses on quickly converging through conversation on acceptable solutions rather than slow long-form exchanges through multiple rounds of conventional peer review that can often end up focusing as much on judging as improving. While I don’t have any personal experiences with this system I’ve seen a number of associate editors talk very positively about the process at Frontiers.
This post isn’t intended to advocate for any of these particular journals or approaches. These are definitely experimental and we may find that some of them have serious limitations. What I do advocate for is that we conduct these kinds of experiments with academic publishing and support the folks who are taking the lead by developing and test driving these systems to see how they work. To do anything else strikes me as accepting that current academic publishing practices are at their global optimum. That seems fairly unlikely to me, which makes the scientist in me want to explore different approaches so that we can find out how to best evaluate and improve scientific research.
UPDATE: Fixed link to the Faculty of 1000 Research paper that I reviewed. Thanks Jeremy!
UPDATE 2: Added a missing link to Faculty of 1000 Research’s main site.
UPDATE 3: Fixed the missing link to Frontiers in Ecology & Evolution. Apparently I was seriously linking challenged this morning.
EcoBloggers is a relatively new blog aggregator started by the awesome International Network of Next-Generation Ecologists (INNGE). Blog aggregators pull together posts from a number of related blogs to provide a one stop shop for folks interested in that topic. The most famous example of a blog aggregator in science is probably Research Blogging. I’m a big fan of EcoBloggers for three related reasons.
- It provides easy access to the conversations going on in the ecology blogosphere for folks who don’t have a well organized system for keeping up with blogs. If your only approach to keeping up with blogs is to check them yourself via your browser when you have a few spare minutes (or when you’re procrastinating on writing that next paper or grant) it really helps if you don’t have to remember to check a dozen or more sites, especially since some of those sites won’t post particularly frequently. Just checking EcoBloggers can quickly let you see what everyone’s been talking about over the last few days or weeks. Of course, I would really recommend using a feed reader both for tracking blogs and journal tables of contents, but lots of folks aren’t going to do that and blog aggregators are the next best thing.
- EcoBloggers helps new blogs, blogs with smaller audiences, and those that don’t post frequently, reach the broader community of ecologists. This is important for building a strong ecological blogging community by keeping lots of bloggers engaged and participating in the conversation.
- It helps expose readers to the breadth of conversations happening across ecology. This helps us remember that not everyone thinks like us or is interested in exactly the same things.
The site is also nicely implemented so that it respects the original sources of the content
- It’s opt-in
- Each post lists the name of the originating blog and the original author
- All links take you to the original source
- It aggregates using RSS feeds you can set your site so that only partial articles show up on EcoBloggers (of course this requires you to ignore my advice on providing full feeds)
Are there any downsides to having your blog on EcoBloggers? I don’t think so. The one issue that might be raised is that if someone reads your article on EcoBloggers, then they may not actually visit your site and your stats could end up being lower than they would have otherwise. If any of the ecology blogs were making a lot of money off of advertising I could see this being an issue, but they aren’t. We’re presumably all here to engage in scientific dialogue and to communicate our ideas as brobably as possible. This is only aided by participating in an aggregator because your writing will reach more people than it would otherwise.
So, checkout EcoBloggers, use it to keep up with what’s going on in the ecology blogosphere, and sign up your blog today.
UPDATE: According to a short chat on Twitter, EcoBloggers will soon be automatically shortening the posts on their site even if your blog is providing full feeds. This means that if you didn’t buy my arguments above and were worried about loosing page views, there’s nothing to worry about. If the first paragraph or so of your post is interesting enough to get people hooked they’ll have to come over to your blog to read the rest.
Dear Ecology Letters and the British Ecological Society ,
I am writing to ask that you support the scientific good by allowing the submission of papers that have been posted as preprints. I or my colleagues have reached out to you before without success, but I have heard through various grapevines that both of you are discussing this possibility and I want to encourage you to move forward with allowing this important practice.
The benefits of preprints to science are substantial. They include:
- More rapid communication and discussion of important scientific results
- Improved quality of published research by allowing for more extensive pre-publication peer review
- A fair mechanism for establishing precedence that is not contingent the idiosyncrasies of formal peer review
- A way for early-career scientists to demonstrate productivity and impact on a time scale that matches their need to apply for postdoctoral fellowships and jobs
I am writing to you specifically because your journals represent the major stumbling block for those of us interested in improving science by posting preprints. Your journals either explicitly do not allow the submission of papers that have preprints posted online or lack explicit statements that it is OK to do so. This means that if there is any possibility of eventually submitting a paper to one of these journals then researchers must avoid posting preprints.
The standard justification that journals give for not allowing preprints is that they constitute “prior publication”. However, this is not an issue for two reasons. First, preprints are not peer reviewed. They are the equivalent of a long established practice in biology of sending manuscripts to colleagues for friendly review and to make them aware of cutting edge work. They simply take advantage of the internet to scale this to larger numbers of colleagues. Second, the vast majority of publication outlets do not believe that preprints represent prior publication, and therefore the publication ethics of the broader field of academic publishing clearly allows this. In particular Science, Nature, PNAS, the Ecological Society of America, the Royal Society, Springer, and Elsevier all generally allow the posting of preprints. Nature even wrote about this policy nearly a decade ago stating that:
Nature never wishes to stand in the way of communication between researchers. We seek rather to add value for authors and the community at large in our peer review, selection and editing… Communication between researchers includes not only conferences but also preprint servers… As first stated in an editorial in 1997, and since then in our Guide to Authors, if scientists wish to display drafts of their research papers on an established preprint server before or during submission to Nature or any Nature journal, that’s fine by us.
If you’d like to learn more about the value of preprints, and see explanations of why some of the other common concerns about preprints are unjustified, some colleagues and I have published a paper on The Case for Open Preprints in Biology.
So, I am asking that for the good of science, and to bring your journals in line with widely accepted publication practices, that you please move quickly to explicitly allow the submission of papers that have been posted as preprints.
A friend of mine once joked that doing ecological informatics meant working with data that was big enough that you couldn’t open it in an Excel spreadsheet. At the time (~6 years ago) that meant a little over 64,000 rows in a table). Times have changed a bit since then, We now talk about “big data” instead of “informatics”, Excel can open a table with a little over 1,000,000 rows of data, and most importantly there is an ever increasing amount of publicly available ecological, evolutionary, and environmental data that we can use for tackling ecological questions.
I’ve been into using relatively big data since I entered graduate school in the late 1990s. My dissertation combined analyses of the Breeding Bird Survey of North America (several thousand sites) and assembling hundreds of other databases to understand how patterns varied across ecosystems and taxonomic groups.
One of the reasons that I like using large amounts of data is that has the potential to gives us general answers to ecological questions quickly. The typical development of an ecological idea over the last few decades can generally be characterized as:
- Come up with an idea
- Test it with one or a few populations, communities, etc.
- Publish (a few years ago this would often come even before Step 2)
- In a year or two test it again with a few more populations, communities, etc.
- Either find agreement with the original study or find a difference
- Debate generality vs. specificity
- Lather, rinse, repeat
After a few rounds of this, taking roughly a decade, we gradually started to have a rough idea of whether the initial result was general and if not how it varied among ecosystems, taxonomic groups, regions, etc.
This is fine, and in cases where new data must be generated to address the question this is pretty much what we have to do, but wouldn’t it be better if we could ask and answer the question more definitely with the first paper. This would allow us to make more rapid progress as a science because instead of repeatedly testing and reevaluating the original analysis we would be moving forward and building on the known results. And even if it still takes time to get to this stage, as with meta-analyses that build on decades of individual tests, using all of the available data still provides us with a general answer that is clearer and more (or at least differently) informative than simply reading the results of dozens of similar papers.
So, to put it simply, one of the benefits of using “big data” is to get the most general answer possible to the question of interest.
Now, it’s clear that this idea doesn’t sit well with some folks. Common responses to the use of large datasets (or compilations of small ones) include concerns about the quality of large datasets or the ability of individuals who haven’t collected the data to fully understand it. My impression is that these concerns stem from a tendancy to associate “best” with “most precise”. My personal take is that being precise is only half of the problem. If I collect the best dataset imaginable for characterizing pattern/process X, but it only provides me with information on a single taxonomic group at a single site, then, while I can have a lot of confidence in my results, I have no idea whether or not my results apply beyond my particular system. So, precision is great, but so is getting genearlizable results, and these two things trade off against one another.
Which leads me to what I increasingly consider to be the ideal scenario for areas of ecological research where some large datasets (either inherently large or assembled from lots of small datasets) can be applied to the question of interest. I think the ideal scenario is a combination of “high quality” and “big” data. By analyzing these two sets of data separately, and determining if the results are consistent we can have the maximum confidence in our understanding of the pattern/process. This is of course not trivial to do. First it requires a clear idea of what is high quality for a particular question and what isn’t. In my experience folks rarely agree on this (which is why I built the Ecological Data Wiki). Second, it further increases the amount of time, effort, and knowledge that goes into the ideal study, and finding the resources to identify and combine these two kinds of data will not be easy. But, if we can do this (and I think I remember seeing it done well in some recent ecological meta-analyses that I can’t seem to find at the moment) then we will have the best possible answer to an ecological question.
- Big data and the future of ecology
- The new bioinformatics: integrating ecological data from the gene to the biosphere
- Statistical machismo (for more on the tradeoffs inherent in being more precise)
As a budding macroecologist, I have thought a lot about what skills I need to acquire during my Ph.D. This is my model of the four basic attributes for a macroecologist, although I think it is more generally applicable to many ecologists as well:
- Knowledge of SQL
- Dealing with proper database format and structure
- Finding data
- Appropriate treatments of data
- Understanding what good data are
- Monte Carlo methods
- Maximum likelihood methods
- Power analysis
- Higher level calculus
- Should be able to derive analytical solutions for problems
- Should be able to write programs for analysis, not just simple statistics and simple graphs.
- Able to use version control
- Once you can program in one language, you should be able to program in other languages without much effort, but should be fluent in at least one language.
Achieve expertise in at least 2 out of the 4 basic areas, but be able to communicate with people who have skills in the other areas. However, if you are good at collaboration and come up with really good questions, you can make up for skill deficiencies by collaborating with others who possess those skills. Start with smaller collaborations with the people in your lab, then expand outside your lab or increase the number of collaborators as your collaboration skills improve.
Achieving proficiency in an area is best done by using it for a project that you are interested in. The more you struggle with something, the better you understand it eventually, so working on a project is a better way to learn than trying to learn by completing exercises.
The attribute should be generalizable to other problems: For example, if you need to learn maximum likelihood for your project, you should understand how to apply it to other questions. If you need to run an SQL query to get data from one database, you should understand how to write an SQL query to get data from a different database.
In graduate school:
Someone who wants to compile their own data or work with existing data sets needs to develop a good intuitive feel for data; even if they cannot write SQL code, they need to understand what good and bad databases look like and develop a good sense for questionable data, and how known issues with data could affect the appropriateness of data for a given question. The data skill is also useful if a student is collecting field data, because a little bit of thought before data collection goes a long way toward preventing problems later on.
A student who is getting a terminal master’s and is planning on using pre-existing data should probably be focusing on the data skill (because data is a highly marketable skill, and understanding data prevents major mistakes). If the data are not coming from a central database, like the BBS, where the quality of the data is known, additional time will have to be added for time to compile data, time to clean the data, and time to figure out if the data can be used responsibly, and time to fill holes in the data.
Master’s students who want to go on for a Ph.D. should decide what questions they are interested in and should try to pick a project that focuses on learning a good skill that will give them a headstart- more empirical (programming or stats), more theoretical (math), more applied (math (e.g., for developing models), stats(e.g., applying pre-existing models and evaluating models, etc.), or programming (e.g. making tools for people to use)).
Ph.D. students need to figure out what types of questions they are interested in, and learn those skills that will allow them to answer those questions. Don’t learn a skill because it is trendy or you think it will help you get a job later if you don’t actually want to use that skill. Conversely, don’t shy away from learning a skill if it is essential for you to pursue the questions you are interested in.
Right now, as a Ph.D. student, I am specializing in data and programming. I speak enough math and stats that I can communicate with other scientists and learn the specific analytical techniques I need for a given project. For my interests (testing questions with large datasets), I think that by the time I am done with my Ph.D., I will have the skills I need to be fairly independent with my research.