Slides and script from Ethan White’s Ignite talk on Big Data in Ecology from Sandra Chung and Jacquelyn Gill‘s excellent ESA 2013 session on Sharing Makes Science Better. Slides are also archived on figshare.
1. I’m here to talk to you about the use of big data in ecology and to help motivate a lot of the great tools and approaches that other folks will talk about later in the session.
2. The definition of big is of course relative, and so when we talk about big data in ecology we typically mean big relative to our standard approaches based on observations and experiments conducted by single investigators or small teams.
3. And for those of you who prefer a more precise definition, my friend Michael Weiser defines big data and ecoinformatics as involving anything that can’t be successfully opened in Microsoft Excel.
4. Data can be of unusually large size in two ways. It can be inherently large, like citizen science efforts such as Breeding Bird Survey, where large amounts of data are collected in a consistent manner.
5. Or it can be large because it’s composed of a large number of small datasets that are compiled from sources like Dryad, figshare, and Ecological Archives to form useful compilation datasets for analysis.
6. We have increasing amounts of both kinds of data in ecology as a result of both major data collection efforts and an increased emphasis on sharing data.
7-8. But what does this kind of data buy us. First, big data allows us to work at scales beyond those at which traditional approaches are typically feasible. This is critical because many of the most pressing issues in ecology including climate change, biodiversity, and invasive species operate at broad spatial and long temporal scales.
9-10. Second, big data allows us to answer questions in general ways, so that we get the answer today instead of waiting a decade to gradually compile enough results to reach concensus. We can do this by testing theories using large amounts of data from across ecosystems and taxonomic groups, so that we know that our results are general, and not specific to a single system (e.g., White et al. 2012).
11. This is the promise of big data in ecology, but realizing this potential is difficult because working with either truly big data or data compilations is inherently challenging, and we still lack sufficient data to answer many important questions.
12. This means that if we are going to take full advantage of big data in ecology we need 3 things. Training in computational methods for ecologists, tools to make it easier to work with existing data, and more data.
13. We need to train ecologists in the computational tools needed for working with big data, and there are an increasing number of efforts to do this including Software Carpentry (which I’m actively involved in) as well as training initiatives at many of the data and synthesis centers.
14. We need systems for storing, distributing, and searching data like DataONE, Dryad, NEON‘s data portal, as well as the standardized metadata and associated tools that make finding data to answer a particular research question easier.
15. We need crowd-sourced systems like the Ecological Data Wiki to allow us to work together on improving insufficient metadata and understanding what kinds of analyses are appropriate for different datasets and how to conduct them rigorously.
16. We need tools for quickly and easily accessing data like rOpenSci and the EcoData Retriever so that we can spend our time thinking and analyzing data rather than figuring out how to access it and restructure it.
17. We also need systems that help turn small data into big data compilations, whether it be through centralized standardized databases like GBIF or tools that pull data together from disparate sources like Map of Life.
18. And finally we we need to continue to share more and more data and share it in useful ways. With the good formats, standardized metadata, and open licenses that make it easy to work with.
19. And so, what I would like to leave you with is that we live in an exciting time in ecology thanks to the generation of large amounts of data by citizen science projects, exciting federal efforts like NEON, and a shift in scientific culture towards sharing data openly.
20. If we can train ecologists to work with and combine existing tools in interesting ways, it will let us combine datasets spanning the surface of the globe and diversity of life to make meaningful predictions about ecological systems.