EcoData Retriever: quickly download and cleanup ecological data so you can get back to doing science

Retreiver Logo

If you’ve every worked with scientific data, your own or someone elses, you know that you can end up spending a lot of time just cleaning up the data and getting it in a state that makes it ready for analysis. This involves everything from cleaning up non-standard nulls values to completely restructuring the data so that tools like R, Python, and database management systems (e.g., MS Access, PostgreSQL) know how to work with them. Doing this for one dataset can be a lot of work and if you work with a number of different databases like I do the time and energy can really take away from the time you have to actually do science.

Over the last few years Ben Morris and I been working on a project called the EcoData Retriever to make this process easier and more repeatable for ecologists. With a click of a button, or a single call from the command line, the Retriever will download an ecological dataset, clean it up, restructure and assemble it (if necessary) and install it into your database management system of choice (including MS Access, PostgreSQL, MySQL, or SQLite) or provide you with CSV files to load into R, Python, or Excel.

Just click on the box to get the data:

retriever_main

Or run a command like this from the command line:

retriever install msaccess BBS --file myaccessdb.accdb

This means that instead of spending a couple of days wrangling a large dataset like the North American Breeding Bird Survey into a state where you can do some science, you just ask the Retriever to take care of it for you. If you work actively with Breeding Bird Survey data and you always like to use the most up to date version with the newest data and the latest error corrections, this can save you a couple of days a year. If you also work with some of the other complicated ecological datasets like Forest Inventory and Analysis and Alwyn Gentry’s Forest Transect data, the time savings can easily be a week.

The Retriever handles things like:

  1. Creating the underlying database structures
  2. Automatically determining delimiters and data types
  3. Downloading the data (and if there are over 100 data files that can be a lot of clicks)
  4. Transforming data into standard structures so that common tools in R and Python and relational database management systems know how to work with it (e.g., converting cross-tabulated data)
  5. Converting non-standard null values (e.g., 999.0, -999, NoData) into standard ones
  6. Combining multiple data files into single tables
  7. Placing all related tables in a single database or schema

The EcoData Retriever currently includes a number of large, openly available, ecological datasets (see a full list here). It’s also easy to add new datasets to the EcoData Retriever if you want to. For simple data tables a Retriever script can be as simple as:

name: Name of the dataset
description: A brief description of the dataset of ~25 words.
shortname: A one word name for the dataset
table: MyTableName, http://awesomedatasource.com/dataset

The Retriever has an installer for Windows, an App for Mac, and a package for Ubuntu/Debian Linux. See the quick explanation of how to get started and then go take it for a spin.

If you’re interested in reading more about the Retriever you can checkout the website or read our paper on the project.

We also have some exciting new features on the To Do list including:

  • Automatically cleaning up the taxonomy using existing services
  • Providing detailed tracking of the provenance of your data by recording the date it was downloaded, the version of the software used, and information about what cleanup steps the Retriever performed
  • Integration into R and Python

Let us know what you think we should work on next in the comments.

About Ethan White

I'm a happily married dad and a scientist. I like computers, math, stats, and good scotch. I believe in the importance of open science and a free and open web.

Posted on February 13, 2014, in computers, data, ecology, open science, productivity, science, things you should use. Bookmark the permalink. 3 Comments.

  1. Both links in the “If you’re interested in reading more…” sentence don’t work.

  2. Thanks for catching that. They should be fixed now.

  1. Pingback: What’s New in Open Science: Updates from our community call | Mozilla Science Lab

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,845 other followers

%d bloggers like this: