Going big with data in ecology
A friend of mine once joked that doing ecological informatics meant working with data that was big enough that you couldn’t open it in an Excel spreadsheet. At the time (~6 years ago) that meant a little over 64,000 rows in a table). Times have changed a bit since then, We now talk about “big data” instead of “informatics”, Excel can open a table with a little over 1,000,000 rows of data, and most importantly there is an ever increasing amount of publicly available ecological, evolutionary, and environmental data that we can use for tackling ecological questions.
I’ve been into using relatively big data since I entered graduate school in the late 1990s. My dissertation combined analyses of the Breeding Bird Survey of North America (several thousand sites) and assembling hundreds of other databases to understand how patterns varied across ecosystems and taxonomic groups.
One of the reasons that I like using large amounts of data is that has the potential to gives us general answers to ecological questions quickly. The typical development of an ecological idea over the last few decades can generally be characterized as:
- Come up with an idea
- Test it with one or a few populations, communities, etc.
- Publish (a few years ago this would often come even before Step 2)
- In a year or two test it again with a few more populations, communities, etc.
- Either find agreement with the original study or find a difference
- Debate generality vs. specificity
- Lather, rinse, repeat
After a few rounds of this, taking roughly a decade, we gradually started to have a rough idea of whether the initial result was general and if not how it varied among ecosystems, taxonomic groups, regions, etc.
This is fine, and in cases where new data must be generated to address the question this is pretty much what we have to do, but wouldn’t it be better if we could ask and answer the question more definitely with the first paper. This would allow us to make more rapid progress as a science because instead of repeatedly testing and reevaluating the original analysis we would be moving forward and building on the known results. And even if it still takes time to get to this stage, as with meta-analyses that build on decades of individual tests, using all of the available data still provides us with a general answer that is clearer and more (or at least differently) informative than simply reading the results of dozens of similar papers.
So, to put it simply, one of the benefits of using “big data” is to get the most general answer possible to the question of interest.
Now, it’s clear that this idea doesn’t sit well with some folks. Common responses to the use of large datasets (or compilations of small ones) include concerns about the quality of large datasets or the ability of individuals who haven’t collected the data to fully understand it. My impression is that these concerns stem from a tendancy to associate “best” with “most precise”. My personal take is that being precise is only half of the problem. If I collect the best dataset imaginable for characterizing pattern/process X, but it only provides me with information on a single taxonomic group at a single site, then, while I can have a lot of confidence in my results, I have no idea whether or not my results apply beyond my particular system. So, precision is great, but so is getting genearlizable results, and these two things trade off against one another.
Which leads me to what I increasingly consider to be the ideal scenario for areas of ecological research where some large datasets (either inherently large or assembled from lots of small datasets) can be applied to the question of interest. I think the ideal scenario is a combination of “high quality” and “big” data. By analyzing these two sets of data separately, and determining if the results are consistent we can have the maximum confidence in our understanding of the pattern/process. This is of course not trivial to do. First it requires a clear idea of what is high quality for a particular question and what isn’t. In my experience folks rarely agree on this (which is why I built the Ecological Data Wiki). Second, it further increases the amount of time, effort, and knowledge that goes into the ideal study, and finding the resources to identify and combine these two kinds of data will not be easy. But, if we can do this (and I think I remember seeing it done well in some recent ecological meta-analyses that I can’t seem to find at the moment) then we will have the best possible answer to an ecological question.
- Big data and the future of ecology
- The new bioinformatics: integrating ecological data from the gene to the biosphere
- Statistical machismo (for more on the tradeoffs inherent in being more precise)