If you’ve every worked with scientific data, your own or someone elses, you know that you can end up spending a lot of time just cleaning up the data and getting it in a state that makes it ready for analysis. This involves everything from cleaning up non-standard nulls values to completely restructuring the data so that tools like R, Python, and database management systems (e.g., MS Access, PostgreSQL) know how to work with them. Doing this for one dataset can be a lot of work and if you work with a number of different databases like I do the time and energy can really take away from the time you have to actually do science.
Over the last few years Ben Morris and I been working on a project called the EcoData Retriever to make this process easier and more repeatable for ecologists. With a click of a button, or a single call from the command line, the Retriever will download an ecological dataset, clean it up, restructure and assemble it (if necessary) and install it into your database management system of choice (including MS Access, PostgreSQL, MySQL, or SQLite) or provide you with CSV files to load into R, Python, or Excel.
Just click on the box to get the data:
Or run a command like this from the command line:
retriever install msaccess BBS --file myaccessdb.accdb
This means that instead of spending a couple of days wrangling a large dataset like the North American Breeding Bird Survey into a state where you can do some science, you just ask the Retriever to take care of it for you. If you work actively with Breeding Bird Survey data and you always like to use the most up to date version with the newest data and the latest error corrections, this can save you a couple of days a year. If you also work with some of the other complicated ecological datasets like Forest Inventory and Analysis and Alwyn Gentry’s Forest Transect data, the time savings can easily be a week.
The Retriever handles things like:
- Creating the underlying database structures
- Automatically determining delimiters and data types
- Downloading the data (and if there are over 100 data files that can be a lot of clicks)
- Transforming data into standard structures so that common tools in R and Python and relational database management systems know how to work with it (e.g., converting cross-tabulated data)
- Converting non-standard null values (e.g., 999.0, -999, NoData) into standard ones
- Combining multiple data files into single tables
- Placing all related tables in a single database or schema
The EcoData Retriever currently includes a number of large, openly available, ecological datasets (see a full list here). It’s also easy to add new datasets to the EcoData Retriever if you want to. For simple data tables a Retriever script can be as simple as:
name: Name of the dataset description: A brief description of the dataset of ~25 words. shortname: A one word name for the dataset table: MyTableName, http://awesomedatasource.com/dataset
We also have some exciting new features on the To Do list including:
- Automatically cleaning up the taxonomy using existing services
- Providing detailed tracking of the provenance of your data by recording the date it was downloaded, the version of the software used, and information about what cleanup steps the Retriever performed
- Integration into R and Python
Let us know what you think we should work on next in the comments.