Jabberwocky Ecology

Going big with data in ecology

A friend of mine once joked that doing ecological informatics meant working with data that was big enough that you couldn’t open it in an Excel spreadsheet. At the time (~6 years ago) that meant a little over 64,000 rows in a table). Times have changed a bit since then, We now talk about “big data” instead of “informatics”, Excel can open a table with a little over 1,000,000 rows of data, and most importantly there is an ever increasing amount of publicly available ecological, evolutionary, and environmental data that we can use for tackling ecological questions.

I’ve been into using relatively big data since I entered graduate school in the late 1990s. My dissertation combined analyses of the Breeding Bird Survey of North America (several thousand sites) and assembling hundreds of other databases to understand how patterns varied across ecosystems and taxonomic groups.

One of the reasons that I like using large amounts of data is that has the potential to gives us general answers to ecological questions quickly. The typical development of an ecological idea over the last few decades can generally be characterized as:

  1. Come up with an idea
  2. Test it with one or a few populations, communities, etc.
  3. Publish (a few years ago this would often come even before Step 2)
  4. In a year or two test it again with a few more populations, communities, etc.
  5. Either find agreement with the original study or find a difference
  6. Debate generality vs. specificity
  7. Lather, rinse, repeat

After a few rounds of this, taking roughly a decade, we gradually started to have a rough idea of whether the initial result was general and if not how it varied among ecosystems, taxonomic groups, regions, etc.

This is fine, and in cases where new data must be generated to address the question this is pretty much what we have to do, but wouldn’t it be better if we could ask and answer the question more definitely with the first paper. This would allow us to make more rapid progress as a science because instead of repeatedly testing and reevaluating the original analysis we would be moving forward and building on the known results. And even if it still takes time to get to this stage, as with meta-analyses that build on decades of individual tests, using all of the available data still provides us with a general answer that is clearer and more (or at least differently) informative than simply reading the results of dozens of similar papers.

So, to put it simply, one of the benefits of using “big data” is to get the most general answer possible to the question of interest.

Now, it’s clear that this idea doesn’t sit well with some folks. Common responses to the use of large datasets (or compilations of small ones) include concerns about the quality of large datasets or the ability of individuals who haven’t collected the data to fully understand it. My impression is that these concerns stem from a tendancy to associate “best” with “most precise”. My personal take is that being precise is only half of the problem. If I collect the best dataset imaginable for characterizing pattern/process X, but it only provides me with information on a single taxonomic group at a single site, then, while I can have a lot of confidence in my results, I have no idea whether or not my results apply beyond my particular system. So, precision is great, but so is getting genearlizable results, and these two things trade off against one another.

Which leads me to what I increasingly consider to be the ideal scenario for areas of ecological research where some large datasets (either inherently large or assembled from lots of small datasets) can be applied to the question of interest. I think the ideal scenario is a combination of “high quality” and “big” data. By analyzing these two sets of data separately, and determining if the results are consistent we can have the maximum confidence in our understanding of the pattern/process. This is of course not trivial to do. First it requires a clear idea of what is high quality for a particular question and what isn’t. In my experience folks rarely agree on this (which is why I built the Ecological Data Wiki). Second, it further increases the amount of time, effort, and knowledge that goes into the ideal study, and finding the resources to identify and combine these two kinds of data will not be easy. But, if we can do this (and I think I remember seeing it done well in some recent ecological meta-analyses that I can’t seem to find at the moment) then we will have the best possible answer to an ecological question.

Further reading:

Four basic skill areas for a macroecologist [Guest post]

This is a guest post by Elita Baldridge (@elitabaldridge), a graduate student in Ethan White’s lab in the Ecology Center at Utah State University.

As a budding macroecologist, I have thought a lot about what skills I need to acquire during my Ph.D. This is my model of the four basic attributes for a macroecologist, although I think it is more generally applicable to many ecologists as well:

  • Data
  • Statistics
  • Math
  • Programming

Data:

  • Knowledge of SQL
  • Dealing with proper database format and structure
  • Finding data
  • Appropriate treatments of data
  • Understanding what good data are

Statistics:

  • Bayesian
  • Monte Carlo methods
  • Maximum likelihood methods
  • Power analysis
  • etc.

Math:

  • Higher level calculus
  • Should be able to derive analytical solutions for problems
  • Modelling

Programming:

  • Should be able to write programs for analysis, not just simple statistics and simple graphs.
  • Able to use version control
  • Once you can program in one language, you should be able to program in other languages without much effort, but should be fluent in at least one language.

General recommendations:

Achieve expertise in at least 2 out of the 4 basic areas, but be able to communicate with people who have skills in the other areas.  However, if you are good at collaboration and come up with really good questions, you can make up for skill deficiencies by collaborating with others who possess those skills.  Start with smaller collaborations with the people in your lab, then expand outside your lab or increase the number of collaborators as your collaboration skills improve.

Gaining skills:

Achieving proficiency in an area is best done by using it for a project that you are interested in.  The more you struggle with something, the better you understand it eventually, so working on a project is a better way to learn than trying to learn by completing exercises.

The attribute should be generalizable to other problems:  For example, if you need to learn maximum likelihood for your project, you should understand how to apply it to other questions.  If you need to run an SQL query to get data from one database, you should understand how to write an SQL query to get data from a different database.

In graduate school:

Someone who wants to compile their own data or work with existing data sets needs to develop a good intuitive feel for data; even if they cannot write SQL code, they need to understand what good and bad databases look like and develop a good sense for questionable data, and how known issues with data could affect the appropriateness of data for a given question. The data skill is also useful if a student is collecting field data, because a little bit of thought before data collection goes a long way toward preventing problems later on.

A student who is getting a terminal master’s and is planning on using pre-existing data should probably be focusing on the data skill (because data is a highly marketable skill, and understanding data prevents major mistakes).  If the data are not coming from a central database, like the BBS, where the quality of the data is known, additional time will have to be added for time to compile data, time to clean the data, and time to figure out if the data can be used responsibly, and time to fill holes in the data.

Master’s students who want to go on for a Ph.D. should decide what questions they are interested in and should try to pick a project that focuses on learning a good skill that will give them a headstart- more empirical (programming or stats), more theoretical (math), more applied (math (e.g., for developing models), stats(e.g., applying pre-existing models and evaluating models, etc.), or programming (e.g. making tools for people to use)).

Ph.D. students need to figure out what types of questions they are interested in, and learn those skills that will allow them to answer those questions.  Don’t learn a skill because it is trendy or you think it will help you get a job later if you don’t actually want to use that skill.  Conversely, don’t shy away from learning a skill if it is essential for you to pursue the questions you are interested in.

Right now, as a Ph.D. student, I am specializing in data and programming.  I speak enough math and stats that I can communicate with other scientists and learn the specific analytical techniques I need for a given project.  For my interests (testing questions with large datasets), I think that by the time I am done with my Ph.D., I will have the skills I need to be fairly independent with my research.

Open talks and posters from Weecology at #ESA2013

We had a great time at ESA this year and enjoyed getting to interact with lots of both old and new friends and colleagues. Since we’re pretty into open science here at Weecology, it’s probably not surprising that we have a lot of slides (and even scripts) from our many and varied talks and posters posted online, and we thought it might be helpful to aggregate them all in one place. Enjoy.

Thanks to Dan McGlinn for help to assembling the links.

Ignite Talk: Big Data in Ecology

Slides and script from Ethan White’s Ignite talk on Big Data in Ecology from Sandra Chung and Jacquelyn Gill‘s excellent ESA 2013 session on Sharing Makes Science Better. Slides are also archived on figshare.

Title slide

1.  I’m here to talk to you about the use of big data in ecology and to help motivate a lot of the great tools and approaches that other folks will talk about later in the session.

Photos of field work

2.  The definition of big is of course relative, and so when we talk about big data in ecology we typically mean big relative to our standard approaches based on observations and experiments conducted by single investigators or small teams.

Image of Microsoft Excel

3.  And for those of you who prefer a more precise definition, my friend Michael Weiser defines big data and ecoinformatics as involving anything that can’t be successfully opened in Microsoft Excel.

Map of Breeding Bird Survey

4.  Data can be of unusually large size in two ways. It can be inherently large, like citizen science efforts such as Breeding Bird Survey, where large amounts of data are collected in a consistent manner.

Images of Dryad, figshare, and Ecological Archives

5.  Or it can be large because it’s composed of a large number of small datasets that are compiled from sources like Dryad, figshare, and Ecological Archives to form useful compilation datasets for analysis.

Dataset logos

6.  We have increasing amounts of both kinds of data in ecology as a result of both major data collection efforts and an increased emphasis on sharing data.

Maps and quote about large scale ecology from NEON

7-8.  But what does this kind of data buy us. First, big data allows us to work at scales beyond those at which traditional approaches are typically feasible. This is critical because many of the most pressing issues in ecology including climate change, biodiversity, and invasive species operate at broad spatial and long temporal scales.

Map and results of general analysis

9-10.  Second, big data allows us to answer questions in general ways, so that we get the answer today instead of waiting a decade to gradually compile enough results to reach concensus. We can do this by testing theories using large amounts of data from across ecosystems and taxonomic groups, so that we know that our results are general, and not specific to a single system (e.g., White et al. 2012).

The most interesting man in the worlds says: I don't always analyze data, but when I do, I prefer a lot of it

11. This is the promise of big data in ecology, but realizing this potential is difficult because working with either truly big data or data compilations is inherently challenging, and we still lack sufficient data to answer many important questions.

Bullett points: 1. Training, 2. Tools, 3. More data.

12. This means that if we are going to take full advantage of big data in ecology we need 3 things. Training in computational methods for ecologists, tools to make it easier to work with existing data, and more data.

Logos of groups running training initiatives

13. We need to train ecologists in the computational tools needed for working with big data, and there are an increasing number of efforts to do this including Software Carpentry (which I’m actively involved in) as well as training initiatives at many of the data and synthesis centers.

Logos for DataONE, Dryad, NEON, Morpho, and DataUP

14. We need systems for storing, distributing, and searching data like DataONE, Dryad, NEON‘s data portal, as well as the standardized metadata and associated tools that make finding data to answer a particular research question easier.

Screenshot of Ecological Data Wiki

15. We need crowd-sourced systems like the Ecological Data Wiki to allow us to work together on improving insufficient metadata and understanding what kinds of analyses are appropriate for different datasets and how to conduct them rigorously.

rOpenSci and EcoData Retriever logos

16. We need tools for quickly and easily accessing data like rOpenSci and the EcoData Retriever so that we can spend our time thinking and analyzing data rather than figuring out how to access it and restructure it.

Map of Life, GBIF, and EcoData Retriever logos

17. We also need systems that help turn small data into big data compilations, whether it be through centralized standardized databases like GBIF or tools that pull data together from disparate sources like Map of Life.

Screen shot of preprint, and Morpho, DataUP, and CC0 logos

18. And finally we we need to continue to share more and more data and share it in useful ways. With the good formats, standardized metadata, and open licenses that make it easy to work with.

Dataset logos

19. And so, what I would like to leave you with is that we live in an exciting time in ecology thanks to the generation of large amounts of data by citizen science projects, exciting federal efforts like NEON, and a shift in scientific culture towards sharing data openly.

River Ernest-White saying "Aw Dad, Big Data s sch a buzz word"

20. If we can train ecologists to work with and combine existing tools in interesting ways, it will let us combine datasets spanning the surface of the globe and diversity of life to make meaningful predictions about ecological systems.

[Preprint] Nine simple ways to make it easier to (re)use your data

I’m a big fan of preprints, the posting of papers in public archives prior to peer review. Preprints speed up the scientific dialogue by letting everyone see research as it happens, not 6 months to 2 years later following the sometimes extensive peer review process. They also allow more extensive pre-publication peer review because input can be solicited from the entire community of scientists, not just two or three individuals. You can read more about the value of preprints in our preprint about preprints (yes, really) posted on figshare.

In the spirit of using preprints to facilitate broad pre-publication peer review a group of weecologists have just posted a preprint on how to make it easier to reuse data that is shared publicly. Since PeerJ‘s commenting system isn’t live yet we would like to encourage your to provide feedback about the paper here in the comments. It’s for a special section of Ideas in Ecology and Evolution on data sharing (something else I’m a big fan of) that is being organized by Karthik Ram (someone I’m a big fan of).

Our nine recommendations are:

  1. Share your data
  2. Provide metadata
  3. Provide an unprocessed form of the data
  4. Use standard data formats (including file formats, table structures, and cell contents)
  5. Use good null values
  6. Make it easy to combine your data with other datasets
  7. Perform basic quality control
  8. Use an established repository
  9. Use an established and liberal license

Most of this territory has been covered before by a number of folks in the data sharing world, but if you look at the state of most ecological and evolutionary data it clearly bears repeating. In addition, I think that our unique contribution is three fold: 1) We’ve tried hard to stick to relatively simple things that don’t require a huge time commitment to get right; 2) We’ve tried to minimize the jargon and really communicate with the awesome folks who are collecting great data but don’t have much formal background in the best practices of structuring and sharing data; and 3) We contribute the perspective of folks who spend a lot of time working with other people’s data and have therefore encountered many of the most common issues that crop up in ecological and evolutionary data.

So, if you have the time, energy, and inclination, please read the preprint and let us know what you think and what we can do to improve the paper in the comments section.

UPDATE: This manuscript was written in the open on GitHub. You can also feel free to file GitHub issues if that’s more your style.

UPDATE 2: PeerJ has now enabled commenting on preprints, so comments are welcome directly on our preprint as well (https://peerj.com/preprints/7/).

Characterizing the species-abundance distribution with only information on richness and total abundance [Research Summary]

This is the first of a new category of posts here at Jabberwocky Ecology called Research Summaries. We like the idea of communicating our research more broadly than to the small number of folks who have the time, energy, and interest to read through entire papers. So, for every paper that we publish we will (hopefully) also do a blog post communicating the basic idea in a manner targeted towards a more general audience. As a result these posts will intentionally skip over a lot of detail (technical and otherwise), and will intentionally use language that is less precise, in order to communicate more broadly. We suspect that it will take us quite a while to figure out how to do this well. Feedback is certainly welcome.

This is a Research Summary of: White, E.P., K.M. Thibault, and X. Xiao. 2012. Characterizing species-abundance distributions across taxa and ecosystems using a simple maximum entropy model. Ecology. http://dx.doi.org/10.1890/11-2177.1*

The species-abundance distribution describes the number of species with different numbers of individuals. It is well known that within an ecological community most species are relatively rare and only a few species are common, and understanding the detailed form of this distribution of individuals among species has been of interest in ecology for decades. This distribution is considered interesting both because it is a complete characterization of the commonness and rarity of species and because the distribution can be used to test and parameterize ecological models.

Numerous mathematical descriptions of this distribution have been proposed and much of the research into this pattern has focused on trying to figure out which of these descriptions is “the best” for a particular group of species at a small number of sites. We took an alternative approach to this pattern and asked: Can we explain broad scale, cross-taxonomic patterns in the general shape of the abundance distribution using a simple model that requires only knowledge of the species richness and total abundance (summed across all species) at a site?

To do this we used a model that basically describes the most likely form of the distribution if the average number of individuals in a species is fixed (which turns out to be a slightly modified version of the classic log-series distribution; see the paper or John Harte’s new book for details). As a result this model involves no detailed biological processes and if we know richness and total abundance we can predicted the abundance of each species in the community (i.e., the abundance of the most common species, second most common species… rarest species).

Since we wanted to know how well this works in general (not how well it works for birds in Utah or trees in Panama) we put together a a dataset of more than 15,000 communities. We did this by combining 6 major datasets that are either citizen science, big government efforts, or compilations from the literature. This compilation includes data on birds, trees, mammals, and butterflies. So, while we’re missing the microbes and aquatic species, I think that we can be pretty confident that we have an idea of the general pattern.

In general, we can do an excellent job of predicting the abundance of each rank of species (most abundant, second most abundant…) at each site using only information on the species richness and total abundance at the site. Here is a plot of the observed number of individuals in a given rank at a given site against the number predicted. The plot is for Breeding Bird Survey data, but the rest of the datasets produce similar results.

Observed-predicted plot for Breeding Bird Survey data showing a good ability of the model to predict the observed data.

Observed-predicted plot for nearly 3000 Breeding Bird Survey communities. Since there are over 100,000 points on this plot we’ve color coded them by the number of points in the vicinity of the focal point, so red areas have lots of points nearby and blue areas have very few points. The black line is the 1:1 line.

The model isn’t perfect of course (they never are and we highlight some of its failures in the paper), but it means that if we know the richness and total abundance of a site then we can capture over 90% of the variation in the form of the species-abundance distribution across ecosystems and taxonomic groups.

This result is interesting for two reasons:

First, it suggests that the species-abundance distribution, on its own, doesn’t tell us much about the detailed biological processes structuring a community. Ecologists have know that it wasn’t fully sufficient for distinguishing between different models for a while (though we didn’t always act like it), but our results suggest that in fact there is very little additional information in the distribution beyond knowing the species richness and total abundance. As such, any model that yields reasonable richness and total abundance values will probably produce a reasonable species-abundance distribution.

Second, this means that we can potentially predict the full distribution of commonness and rarity even at locations we have never visited. This is possible because richness and total abundance can, at least sometimes, be well predicted using remotely sensed data. These predictions could then be combined with this model of the species-abundance distribution to make predictions for things like the number of rare species at a site. In general, we’re interested in figuring out how much ecological pattern and process can be effectively characterized and predicted at large spatial scales, and this research helps expand that ability.

So, that’s the end of our first Research Summary. I hope it’s a useful thing that folks get something out of. In addition to the science in this paper, I’m also really excited about the process that we used to accomplish this research and to make it as reproducible as possible. So, stay tuned for some follow up posts on big data in ecology, collaborative code development, and making ecological research more reproducible.

———————————————————————————————————————————————————————————————
*The paper will be Open Access once it is officially published but ,for reasons that don’t make a lot of sense to me, it is behind a paywall until it comes out in print.

A new database for mammalian community ecology and macroecology

There are a number of great datasets available for doing macroecology and community ecology at broad spatial scales. These include data on birds (Breeding Bird Survey, Christmas Bird Count), plants (Forest Inventory & Analysis, Gentry’s transects), and insects (North American Butterfly Association Counts). However, if you wanted to do work that relied on knowing the presence or abundance of individuals at particular sites (i.e., you’re looking for something other than range maps) there has never been a decent dataset to work with for mammals.

Announcing the Mammal Community Database (MCDB)

Over the past couple of years we’ve been working to fill that gap as best we could. Since coordinated continental scale surveys of mammals don’t yet exist [1] we dug into the extensive mammalogy literature and compiled a database of 1000 globally distributed communities. Thanks to Kate Thibault‘s leadership and the hard work of Sarah Supp and Mikaelle Giffen, we are happy to announce that this data is now freely available as a data paper on Ecological Archives.

In addition to containing species lists for 1000 locales, there is abundance data for 940 of the locations, some site level body size data (~50 sites) and a handful of reasonably long (> 10 yr) time-series as well. Most of the data is restricted to the particular mode of sampling that an individual mammalogist uses and as a result much of the data is for small mammals captured in Sherman traps.

Working with data compilations like this is always difficult because the differences in sampling intensity and approaches between studies can make it very difficult to compare data across sites. We’ve put together a detailed table of information on how sampling was conducted to help folks break the data into comparable subsets and/or attempt to control for the influence of sampling differences in their statistical models.

The joys of Open Science

We’ve been gradually working on making the science that we do at Weecology more and more open, and the MCDB is an example of that. We submitted the database to Ecological Archives before we had actually done much of anything with it ourselves [2], because the main point of collecting the data was to provide a broadly useful resource to the ecological community, not to answer a specific question. We were really excited to see that as soon as we announced it on Twitter

folks started picking it up and doing cool things with it [3]. We hope that folks will find all sorts of uses for it going forward.

Going forward

We know that there is tons more data out there on mammal communities. Some of it is unpublished, or not published in enough detail for us to include. Some of it has licenses that mean that we can’t add it to the MCDB without special permission (e.g., there is a lot of great LTER mammal data out there). Lots of it we just didn’t find while searching through the literature.

If folks know of more data we’d love to hear about it. If you can give us permission to add data that has more restrictive licensing then we’d love to do so [4]. If you’re interested in collaborating on growing the database let us know. If there’s enough interest we can invest some time in developing a public portal.

The footnotes [5]

[1] We are anxiously awaiting NEON’s upcoming surveys, headed up by former Weecology postdoc Kate Thibault.

[2] We have a single paper that is currently in review that uses the data.

[3] Thanks to Scott Chamberlain and Markus Gesmann. You guys are awesome!

[4] To be clear, we haven’t been asking for permission yet, so no one has turned us down. We wanted to get the first round of data collection done first to show that this was a serious effort.

[5] Because anything that David Foster Wallace loved has to be a good thing.

Postdoc in Evolutionary Bioinformatics [Jobs]

There is an exciting postdoc opportunity for folks interested in quantitative approaches to studying evolution in Michael Gilchrist’s lab at the University of Tennessee. I knew Mike when we were both in New Mexico. He’s really sharp, a nice guy, and a very patient teacher. He taught me all about likelihood and numerical maximization and opened my mind to a whole new way of modeling biological systems. This will definitely be a great postdoc for the right person, especially since NIMBioS is at UTK as well. Here’s the ad:

Outstanding, motivated candidates are being sought for a post-doctoral position in the Gilchrist lab in the Department of Ecology & Evolutionary Biology at the University of Tennessee, Knoxville. The successful candidate will be supported by a three year NSF grant whose goal is to develop, integrate and test mathematical models of protein translation and sequence evolution using available genomic sequence and expression level datasets. Publications directly related to this work include Gilchrist. M.A. 2007, Molec. Bio. & Evol. (http://www.tinyurl/shahgilchrist11) and Shah, P. and M.A. Gilchrist 2011, PNAS (http://www.tinyurl/gilchrist07a).

The emphasis of the laboratory is focused on using biologically motivated models to analyze complex, heterogeneous datasets to answer biologically motivated questions. The research associated with this position draws upon a wide range of scientific disciplines including: cellular biology, evolutionary theory, statistical physics, protein folding, differential equations, and probability. Consequently, the ideal candidate would have a Ph.D. in either biology, mathematics, physics, computer science, engineering, or statistics with a background and interest in at least one of the other areas.

The researcher will collaborate closely with the PIs (Drs. Michael Gilchrist and Russell Zaretzki) on this project but potentiall have time to collaborate on other research projects with the PIs. In addition, the researcher will have opportunties to interact with other faculty members in the Division of Biology as well as researchers at the National Institute for Mathematical and Biological Synthesis (http://www.nimbios.org).

Review of applications begins immediately and will continue until the position is filled. To apply, please submit curriculum vitae including three references, a brief statement of research background and interests, and 1-3 relevant manuscripts to mikeg[at]utk[dot]edu.

Some meandering thoughts on the difference between EcologicalData.org and DataONE

In the comments of my post on the Ecological Data Wiki Jarrett Byrnes asked an excellent question:

Very cool. I’m curious, how do you think this will compare/contrast/fight with the Data One project – https://www.dataone.org/ – or is this a different beast altogether?

As I started to answer it I realized that my thoughts on the matter were better served by a full post, both because they are a bit lengthy and because I don’t actually know much about DataONE and would love to have some of their folks come by, correct my mistaken impressions, and just chat about this stuff in general.

To begin with I should say that I’m still trying to figure this out myself, both because I’m still figuring out exactly what DataONE is going to be, and because EcologicalData is still evolving. I think that both projects goals could be largely defined as “Organizing Ecology’s Data,” but that’s a pretty difficult task, involving a lot of components and a lot of different ways to tackle them. So, my general perspective is that the more folks we have trying the merrier. I suspect there will be plenty of room for multiple related projects, but I’d be just as happy (even happier probably) if we could eventually find a single centralized location for handling all of this. All I want is solution to the challenge.

But, to get to the question at hand, here are the differences I see based on my current understanding of DataONE:

1. Approach. There are currently two major paradigms for organizing large amounts of information. The first is to figure out a way to tell computers how to do it for us (e.g., Google), the second is to crowdsource it’s development and curation (e.g., Wikipedia). DataONE is taking the computer based approach. It’s heavy on metadata, ontologies, etc. The goal is to manage the complexities of ecological data by providing the computer with very detailed descriptions of the data that it can understand. We’re taking the human approach, keeping things simple and trying to leverage the collective knowledge and effort of the field. As part of this difference in approach I suspect that EcologicalData will be much more interactive and community driven (the goal is for the community to actually run the site, just like Wikipedia) whereas DataONE will tend to be more centralized and hierarchical. I honestly couldn’t tell you which will turn out better (perhaps the two approaches will each turn out to be better for different things) but I’m really glad that we’re trying both at the same time to figure out what will work and where their relative strengths might be.

2. Actually serving data. DataONE will do this; we won’t. This is part of the difference in approach. If the computer can handle all of the thinking with respect to the data then you want it to do that and just spit out what you want. Centralizing the distribution of heterogeneous data is a really complicated task and I’m excited the folks at DataONE are tackling the challenge.

a. One of the other challenges for serving data is that is that you have to get all of the folks who “own” the data to let you provide it. This is one of the reasons I came up with the Data Wiki idea. By serving as a portal it helps circumvent the challenges of getting all of the individual stake holders to agree to participate.

b. We do provide a tool for data acquisition, the EcoData Retriever, that likewise focuses on circumventing the need to negotiate with data providers by allowing each individual investigator to automatically download the data from the source. But, this just sets up each dataset independently, whereas I’m presuming that DataONE will let you just run one big query of all the data (which I’m totally looking forward to by the way) [1].

3. Focus. The primary motivation behind the Data Wiki goes beyond identifying datasets and really focuses on how you should use them. Having worked with other folks’ data for a number of years I can say that the biggest challenging (for me anyway) is actually figuring out all of the details of when and how the dataset should be used. This isn’t just a question of reading metadata either. It’s a question of integrating thoughts and approaches from across the literature. What I would like to see develop on the Data Wiki pages is the development of concise descriptions for how to go about using these datasets in the best way possible.  This is a very difficult task to automate and one where I think a crowdsourced solution is likely the most effective. We haven’t done a great job of this yet, but Allen Hurlbert and I have some plans to develop a couple of good examples early in the fall to help demonstrate the idea.

4. We’re open for business. Ha ha, eat our dust DataONE. But seriously, we’ve taken a super simple approach which means we can get up and running quickly. DataONE is doing something much more complicated and so things may take some time to roll out. I’m hoping to get a better idea of what their time lines look like at ESA. I’m sure their tools will be well worth the wait.

5. Oh, and their budget is a little over $2,000,000/year, which is just slightly larger than our budget of around $5,000/year.

So, there is my lengthy and meandering response to Jarrett’s question. I’m looking forward to chatting with DataONE folks at ESA to find out more about what they are up to, and I’d love to have them stop by here to chat and clear up my presumably numerous misconceptions.

——————————————————————————————————————-

[1] Though we do have some ideas for managing something somewhat similar, so stay tuned for EcoData Retriever 2.0. Hopefully coming to an internet near you sometime this spring.

Michael Nielsen on the importance and value of Open Science

We are pretty excited about what modern technology can do for science and in particular the potential for increasingly rapid sharing of, and collaboration on, data and ideas. It’s the big picture that explains why we like to blog, tweet, publish data and code, and we’ve benefited greatly from others who do the same. So, when we saw this great talk by Michael Nielsen about Open Science, we just had to share.

(via, appropriately enough, @gvwilson and @TEDxWaterloo on Twitter)

A GitHub of Science? [Things you should read]

There is an excellent post on open science, prestige economies, and the social web over at Marciovm’s posterous*. For those of you who aren’t insanely nerdy** GitHub is… well… let’s just call it a very impressive collaborative tool for developing and sharing software***. But don’t worry, you don’t need to spend your days tied to a computer or have any interest in writing your own software to enjoy gems like:

Evangelists for Open Science should focus on promoting new, post-publication prestige metrics that will properly incentivize scientists to focus on the utility of their work, which will allow them to start worrying less about publishing in the right journals.

Thanks to Carl Boettiger for pointing me to the post. It’s definitely worth reading in its entirety.

_______________________________________________________

*A blog I’d never heard of before, but I subscribed to it’s RSS feed before I’d even finished the entire post.

**As far as biologists go. And, yes, when I say “insanely nerdy” I do mean it as a complement.

***For those interested in slightly more detail it’s a social application wrapped around the popular distributed version control system named Git. Kind of like Sourceforge on steroids.

Learning to program like a professional using Software Carpentry

An increasingly large number of folks doing research in ecology and other biological disciplines spend a substantial portion of their time writing computer programs to analyze data and simulate the outcomes of biological models. However, most ecologists have little formal training in software development¹. A recent survey suggests that we are not only; with 96% of scientists reporting that they are mostly self-taught when it comes to writing code. This makes sense because there are only so many hours in the day, and scientists are typically more interested in answering important questions in their field than in sitting through a bachelors degree worth of computer science classes. But, it also means that we spend longer than necessary writing our software, it contains more bugs, and it is less useful to other scientists than it could be².

Software Carpentry to the Rescue

Fortunately you don’t need to go back college and get another degree to substantially improve your knowledge and abilities when it comes to scientific programming, because with a few weeks of hard work Software Carpentry will whip you into shape. Software Carpentry was started back in 1997 to teach scientists “the concepts, skills, and tools they need to use and build software more productively” and it does a great job. The newest version of the course is composed of a combination of video lectures and exercises, and provides quick and to the point information on such critical things as:

along with lots of treatment of best practices for writing code that is clear and easy to read both for other people and for yourself a year from now when you sit down and try to figure out exactly what you did³.

The great thing about Software Carpentry is that it skips over all of the theory and detail that you’d get when taking the relevant courses in computer science and gets straight to crux – how to use the available tools most effectively to conduct scientific research. This means that in about 40 hours of lecture and 100-200 hours of practice you can be a much, much, better programmer who rights code more quickly, with fewer bugs, that be easily reused. I think of it as boot camp for scientific software development. You won’t be an expert marksman or a black belt in Jiu-Jitsu when you’re finished, but you will know how to fire a gun and throw a punch.

I can say without hesitation that taking this course is one of the most important things I’ve done in terms of tool development in my entire scientific career. If you are going to write more than 100 lines of code per year for your research then you need to either take this course or find someone to offer something equivalent at your university. Watch the lectures, do the exercises, and it will save you time and energy on programming; giving you more of both to dedicate to asking and answering important scientific questions.

______________________________________________________

¹I took 3 computer science courses in college and I get the impression that that is about 2-3 more courses than most ecologists have taken.

²I don’t know of any data on this, but my impression is that over 90% of code written by ecologists is written by a single individual and never read or used by anyone else. This is in part because we have no culture of writing code in such a way that other people can understand what we’ve done and therefore modify it for their own use.

³I know that I’ve decided that it was easier to “just start from scratch” rather than reusing my own code on more than one occasion. That won’t be happening to me again thanks to Software Carpentry