Data Retriever 2.0: We handle the data so you can focus on the analysis
We are very exited to announce a major new release of the Data Retriever, our software for making it quick and easy to get clean, ready to analyze, versions of publicly available data.
The Data Retriever, automates the downloading, cleaning, and installing of ecological and environmental data into your choice of databases and flat file formats. Instead of hours tracking down the data on the web, downloading it, trying to import it, running into issues (e.g, non-standard nulls, problematic column names, encoding issues), fixing one problem, and then encountering the next, all you need to do is run a single command from the command line:
$ retriever install csv iris $ retriever install sqlite breed-bird-survey -f bbs.sqlite
or from R:
>>> rdataretriever::install('postgres', 'wine-quality') >>> portal_data <- rdataretriever::fetch('portal')
The Data Retriever uses information in Frictionless Data datapackage.json files to automatically handle all of the complexities of “simple” data for you. For more complicated complicated datasets, with dozens of components or major data structure issues, the Retriever uses Python scripts as plugins to handle the major data cleaning work and then automatically handles the rest.
To find out more about the Data Retriever checkout the websites, the full documentation, and the GitHub repositories for both the Data Retriever and the R Data Retriever package.
Expanded focus and name change
For those of you familiar with the EcoData Retriever, this is the same software with a new name. Challenges with the data end of the analysis pipeline occur across disciplines and our tools work just as well for non-ecological data, so we’ve started adding non-ecological data and changed our name to reflect that. We’d love to hear from anyone interested in leading a push to add data from another discipline or just interested in adding a single favorite dataset.
As part of this we’ve changed the name of the R package from
The 2.0 release includes a number of major changes including:
- Python 3 support (a single code base runs on both Python 2 and 3)
- Adoption of the frictionless data datapackage.json standard (replacing our old YAML like metadata system), including a command line interface for creating and editing datapackage.json files
- Add json and xml as available output formats
- Major expansion of the documentation and hosting of the documentation at Read the Docs
- Remove the graphical user interface (to allow us to focus that development time on wrappers for other languages)
- Lots of work under the hood and major improvements in testing
- Broaden scope to include non-ecological data
We are also in the process of releasing version 1.0 of the R package. This version adds the new features in the Data Retriever and also includes major stability improvements, in particular in RStudio and on Windows.
We also have a brand new website.
Upgrading to the new version (UPDATED)
To ensure the smoothest upgrade to the new version we recommend:
retriever reset scriptsfrom the command line
- Uninstall the old version of the EcoData Retriever
- Install the new version
retriever updatefrom the command line
Henry Senyondo is the lead developer for the Data Retriever and has done an amazing job over the past year developing new features and shoring up the fundamentals for the software. He lead the work on 2.0 start to finish.
Akash Goel was a Google Summer of Code student with the project last summer and was responsible for the majority of the work adding Python 3 support and switching the project over to the
Dan McGlinn, the creator of the R package, has continued his excellent leadership of the development of this package. Shawn Taylor, a new contributor, was instrumental in solving the stability issues on Windows/RStudio.
In addition to these core folks our growing group of contributors to both projects have been invaluable for adding new functionality, fixing bugs, and testing new changes. We are super excited to have contributions from 30 different people and will keep working hard to make sure that everyone feels welcome and supported in contributing to the project.
The level of work done to get these releases out the door was only possible due to generous support of the Gordon and Betty Moore Foundation’s Data Driven Discovery Initiative. This support allowed my group to employ Henry as a full time software engineer to work on these and other projects. This kind of active support for the development and maintenance of research oriented software makes sustainable software development at universities possible.
Fork our course: A semester-long Data Carpentry course for biologists
This is post is co-authored by Zack Brym and Ethan White
Over the last year and a half we have been actively developing a semester-long Data Carpentry course designed to be easily customized and integrated into existing graduate and undergraduate curricula.
Data Carpentry for Biologists contains course materials for teaching scientists how to work more effectively with data. The course provides introductions to data management and relational databases, data manipulation and analysis, and data visualization. It covers the same general types of material as a two-day Data Carpentry workshop, but expands the materials and opportunities for practice into a full-length university course. The teaching material uses R and SQLite, with some corresponding materials for Python as well. To help students understand the direct applications to their interests, the examples and exercises focus on biological questions and working with real data. The course emphasizes using best practices to produce reusable and reproducible data analysis.
Active-learning Teaching Materials
Learning computing requires active practice by working through programming problems. Just diving in to computing is challenging for most scientists, so the course instruction is designed to combine short live-coding introductions to concepts followed immediately by the students working on a related exercise. Additional exercises are assigned later for practice. This follows the “I do”, “We do”, “You do” approach to teaching, which leverages the benefits of active-learning and flipped classrooms without leaving students who are less comfortable with the material feeling lost. The bulk of class time is spent working on assigned exercises with the instructor moving around the room helping guide students through things they don’t understand and engaging with students who are thinking about advanced applications of what they’ve learned.
This approach is the result of lots of reading about effective teaching methods and Ethan’s experience teaching this and related courses over the last six years at Utah State University and the University of Florida. It seems to work well for both students that get the material easily and those that find it more challenging. We’ve also tried to make these materials as useful as possible for self-guided students.
Open course development
Software Carpentry and Data Carpentry have shown how powerful collaborative lesson development can be and we’re interested in bringing that to the university classroom. We have designed the course materials to be modular and easy to modify, and the course website easy to clone and set up. All of the teaching materials and associated website files are openly available at the Data Carpentry for Biologists repository on GitHub under CC-BY and MIT licenses. The course materials are all written in Markdown and everything runs on Jekyll through GitHub Pages. Making your own version of the course should take less than an hour. We’ve developed documentation for how to create your own version of the course and how to contribute to development. Exercises and assignments are modular and changing exercises and assignments simply involves reordering items in a list. Adding a new exercise involves creating a new Markdown file and then adding its title to the list of exercises for an assignment.
If you teach, or want to teach, a course like this, we’d love to get you involved. Here are some useful links for getting started.
– I want to teach the course.
– I have some feedback.
– I want to contribute to the project.
We want to be sure getting involved is as easy as possible. We’ve worked hard to provide documentation and help resources for students and instructors. Students can find all they need to know at our student start guide. Instructors have access to course content and site design documentation.
If your having trouble finding something or getting something to work, or simply have some feedback about the course please open a new issue at GitHub or send us an email.
Development of this course was generously support by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.
New release of the EcoData Retriever
We are very exited to announce the newest release of the EcoData Retriever, our software for automating the downloading, cleaning, and installing of ecological and environmental data. Instead of hours or days trying to get complicated datasets like the Breeding Bird Survey ready for analysis, the Retriever lets you simply click a button or run a single command from R or the command line, and your computer does the rest.
It’s been over a year since the last retriever release and there are lots of new features and improvements to be excited about.
- We’ve added 21 new datasets, including major ecological and environmental datasets like eBird, Vertnet, and the Global Wood Density Database, and the PRISM climate data.
- To support all of these datasets we’ve added support for additional data types including greater than memory archive files, and we’ve also improved the ability to control where downloaded files are stored and how they are clustered together.
- We’ve significantly improved documentation and now have a new automatically built documentation site at Read The Docs.
- We’ve also made a lot of under the hood improvements.
This is also the first release that has been overseen by Weecology’s new software engineer, Henry Senyondo. We’re excited to have Henry on the team, and now that he’s around development of both the EcoData Retriever and other lab software projects will be happening more quickly.
A big thanks to the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative for funding this development through Grant GBMF4563 and to the National Science Foundation for funding as part of a CAREER award to Ethan White.
UPDATE: Led by Dan McGlinn we also released a new version of the ecoretriever R interface for the Retriever last fall. This makes using the Retriever from R as simple as:
data <- ecoretriever::fetch("BBS")
EcoData Retriever now supports R and environmental data, and has more datasets
We are very excited to announce the newest release of our EcoData Retriever software and the first release of a supporting R package, ecoretriever. If you’re not familiar with the EcoData Retriever you can read more here.
The biggest improvement to the Retriever in this set of releases is the ability to run it directly from R. Dan McGlinn did a great job leading the development of this package and we got ton of fantastic help from the folks at rOpenSci (most notably Scott Chamberlain, Gavin Simpson, and Karthik Ram). Now, once you install the main EcoData Retriever, you can run it from inside R by doing things like:
install.packages('ecoretriever') library(ecoretriever) # List the datasets available via the Retriever ecoretriever::datasets() # Install the Gentry dataset into csv files in your working directory ecoretriever::install('Gentry', 'csv') # Download the raw Gentry dataset files, without any processing, # to the subdirectory named data ecoretriever::download('Gentry', './data/') # Install and load a dataset as a list Gentry = ecoretriever::fetch('Gentry') names(Gentry) head(Gentry$counts)
The other big advance in this release is the ability to have the Retriever directly download files instead of processing them. This allows us to support data that doesn’t come in standard tabular forms. So, we can now include things like environmental data in GIS formats and phylogenetic data such as supertrees. We’ve used this new capability to allow the automatic downloading of the Bioclim data, one of the most widely used climate datasets in ecology, and the supertree for mammals from Fritz et al. 2009.
Finally, we’ve also add the very cool mammalian diet dataset from Dryad
EcoData Retriever: quickly download and cleanup ecological data so you can get back to doing science
If you’ve every worked with scientific data, your own or someone elses, you know that you can end up spending a lot of time just cleaning up the data and getting it in a state that makes it ready for analysis. This involves everything from cleaning up non-standard nulls values to completely restructuring the data so that tools like R, Python, and database management systems (e.g., MS Access, PostgreSQL) know how to work with them. Doing this for one dataset can be a lot of work and if you work with a number of different databases like I do the time and energy can really take away from the time you have to actually do science.
Over the last few years Ben Morris and I been working on a project called the EcoData Retriever to make this process easier and more repeatable for ecologists. With a click of a button, or a single call from the command line, the Retriever will download an ecological dataset, clean it up, restructure and assemble it (if necessary) and install it into your database management system of choice (including MS Access, PostgreSQL, MySQL, or SQLite) or provide you with CSV files to load into R, Python, or Excel.
Just click on the box to get the data:
Or run a command like this from the command line:
retriever install msaccess BBS --file myaccessdb.accdb
This means that instead of spending a couple of days wrangling a large dataset like the North American Breeding Bird Survey into a state where you can do some science, you just ask the Retriever to take care of it for you. If you work actively with Breeding Bird Survey data and you always like to use the most up to date version with the newest data and the latest error corrections, this can save you a couple of days a year. If you also work with some of the other complicated ecological datasets like Forest Inventory and Analysis and Alwyn Gentry’s Forest Transect data, the time savings can easily be a week.
The Retriever handles things like:
- Creating the underlying database structures
- Automatically determining delimiters and data types
- Downloading the data (and if there are over 100 data files that can be a lot of clicks)
- Transforming data into standard structures so that common tools in R and Python and relational database management systems know how to work with it (e.g., converting cross-tabulated data)
- Converting non-standard null values (e.g., 999.0, -999, NoData) into standard ones
- Combining multiple data files into single tables
- Placing all related tables in a single database or schema
The EcoData Retriever currently includes a number of large, openly available, ecological datasets (see a full list here). It’s also easy to add new datasets to the EcoData Retriever if you want to. For simple data tables a Retriever script can be as simple as:
name: Name of the dataset description: A brief description of the dataset of ~25 words. shortname: A one word name for the dataset table: MyTableName, http://awesomedatasource.com/dataset
The Retriever has an installer for Windows, an App for Mac, and a package for Ubuntu/Debian Linux. See the quick explanation of how to get started and then go take it for a spin.
If you’re interested in reading more about the Retriever you can checkout the website or read our paper on the project.
We also have some exciting new features on the To Do list including:
- Automatically cleaning up the taxonomy using existing services
- Providing detailed tracking of the provenance of your data by recording the date it was downloaded, the version of the software used, and information about what cleanup steps the Retriever performed
- Integration into R and Python
Let us know what you think we should work on next in the comments.
EcoBloggers: The ecology blog aggregator
EcoBloggers is a relatively new blog aggregator started by the awesome International Network of Next-Generation Ecologists (INNGE). Blog aggregators pull together posts from a number of related blogs to provide a one stop shop for folks interested in that topic. The most famous example of a blog aggregator in science is probably Research Blogging. I’m a big fan of EcoBloggers for three related reasons.
- It provides easy access to the conversations going on in the ecology blogosphere for folks who don’t have a well organized system for keeping up with blogs. If your only approach to keeping up with blogs is to check them yourself via your browser when you have a few spare minutes (or when you’re procrastinating on writing that next paper or grant) it really helps if you don’t have to remember to check a dozen or more sites, especially since some of those sites won’t post particularly frequently. Just checking EcoBloggers can quickly let you see what everyone’s been talking about over the last few days or weeks. Of course, I would really recommend using a feed reader both for tracking blogs and journal tables of contents, but lots of folks aren’t going to do that and blog aggregators are the next best thing.
- EcoBloggers helps new blogs, blogs with smaller audiences, and those that don’t post frequently, reach the broader community of ecologists. This is important for building a strong ecological blogging community by keeping lots of bloggers engaged and participating in the conversation.
- It helps expose readers to the breadth of conversations happening across ecology. This helps us remember that not everyone thinks like us or is interested in exactly the same things.
The site is also nicely implemented so that it respects the original sources of the content
- It’s opt-in
- Each post lists the name of the originating blog and the original author
- All links take you to the original source
- It aggregates using RSS feeds you can set your site so that only partial articles show up on EcoBloggers (of course this requires you to ignore my advice on providing full feeds)
Are there any downsides to having your blog on EcoBloggers? I don’t think so. The one issue that might be raised is that if someone reads your article on EcoBloggers, then they may not actually visit your site and your stats could end up being lower than they would have otherwise. If any of the ecology blogs were making a lot of money off of advertising I could see this being an issue, but they aren’t. We’re presumably all here to engage in scientific dialogue and to communicate our ideas as brobably as possible. This is only aided by participating in an aggregator because your writing will reach more people than it would otherwise.
So, checkout EcoBloggers, use it to keep up with what’s going on in the ecology blogosphere, and sign up your blog today.
UPDATE: According to a short chat on Twitter, EcoBloggers will soon be automatically shortening the posts on their site even if your blog is providing full feeds. This means that if you didn’t buy my arguments above and were worried about loosing page views, there’s nothing to worry about. If the first paragraph or so of your post is interesting enough to get people hooked they’ll have to come over to your blog to read the rest.
A new database for mammalian community ecology and macroecology
There are a number of great datasets available for doing macroecology and community ecology at broad spatial scales. These include data on birds (Breeding Bird Survey, Christmas Bird Count), plants (Forest Inventory & Analysis, Gentry’s transects), and insects (North American Butterfly Association Counts). However, if you wanted to do work that relied on knowing the presence or abundance of individuals at particular sites (i.e., you’re looking for something other than range maps) there has never been a decent dataset to work with for mammals.
Announcing the Mammal Community Database (MCDB)
Over the past couple of years we’ve been working to fill that gap as best we could. Since coordinated continental scale surveys of mammals don’t yet exist  we dug into the extensive mammalogy literature and compiled a database of 1000 globally distributed communities. Thanks to Kate Thibault‘s leadership and the hard work of Sarah Supp and Mikaelle Giffen, we are happy to announce that this data is now freely available as a data paper on Ecological Archives.
In addition to containing species lists for 1000 locales, there is abundance data for 940 of the locations, some site level body size data (~50 sites) and a handful of reasonably long (> 10 yr) time-series as well. Most of the data is restricted to the particular mode of sampling that an individual mammalogist uses and as a result much of the data is for small mammals captured in Sherman traps.
Working with data compilations like this is always difficult because the differences in sampling intensity and approaches between studies can make it very difficult to compare data across sites. We’ve put together a detailed table of information on how sampling was conducted to help folks break the data into comparable subsets and/or attempt to control for the influence of sampling differences in their statistical models.
The joys of Open Science
We’ve been gradually working on making the science that we do at Weecology more and more open, and the MCDB is an example of that. We submitted the database to Ecological Archives before we had actually done much of anything with it ourselves , because the main point of collecting the data was to provide a broadly useful resource to the ecological community, not to answer a specific question. We were really excited to see that as soon as we announced it on Twitter
We just published a new data set of 1000 mammal communities esajournals.org/doi/abs/10.189… Check it out and do something cool with it.—
(@weecology) December 28, 2011
folks started picking it up and doing cool things with it . We hope that folks will find all sorts of uses for it going forward.
We know that there is tons more data out there on mammal communities. Some of it is unpublished, or not published in enough detail for us to include. Some of it has licenses that mean that we can’t add it to the MCDB without special permission (e.g., there is a lot of great LTER mammal data out there). Lots of it we just didn’t find while searching through the literature.
If folks know of more data we’d love to hear about it. If you can give us permission to add data that has more restrictive licensing then we’d love to do so . If you’re interested in collaborating on growing the database let us know. If there’s enough interest we can invest some time in developing a public portal.
The footnotes 
 We are anxiously awaiting NEON’s upcoming surveys, headed up by former Weecology postdoc Kate Thibault.
 We have a single paper that is currently in review that uses the data.
 Thanks to Scott Chamberlain and Markus Gesmann. You guys are awesome!
 To be clear, we haven’t been asking for permission yet, so no one has turned us down. We wanted to get the first round of data collection done first to show that this was a serious effort.
 Because anything that David Foster Wallace loved has to be a good thing.
Learning to program like a professional using Software Carpentry
An increasingly large number of folks doing research in ecology and other biological disciplines spend a substantial portion of their time writing computer programs to analyze data and simulate the outcomes of biological models. However, most ecologists have little formal training in software development¹. A recent survey suggests that we are not only; with 96% of scientists reporting that they are mostly self-taught when it comes to writing code. This makes sense because there are only so many hours in the day, and scientists are typically more interested in answering important questions in their field than in sitting through a bachelors degree worth of computer science classes. But, it also means that we spend longer than necessary writing our software, it contains more bugs, and it is less useful to other scientists than it could be².
Software Carpentry to the Rescue
Fortunately you don’t need to go back college and get another degree to substantially improve your knowledge and abilities when it comes to scientific programming, because with a few weeks of hard work Software Carpentry will whip you into shape. Software Carpentry was started back in 1997 to teach scientists “the concepts, skills, and tools they need to use and build software more productively” and it does a great job. The newest version of the course is composed of a combination of video lectures and exercises, and provides quick and to the point information on such critical things as:
along with lots of treatment of best practices for writing code that is clear and easy to read both for other people and for yourself a year from now when you sit down and try to figure out exactly what you did³.
The great thing about Software Carpentry is that it skips over all of the theory and detail that you’d get when taking the relevant courses in computer science and gets straight to crux – how to use the available tools most effectively to conduct scientific research. This means that in about 40 hours of lecture and 100-200 hours of practice you can be a much, much, better programmer who rights code more quickly, with fewer bugs, that be easily reused. I think of it as boot camp for scientific software development. You won’t be an expert marksman or a black belt in Jiu-Jitsu when you’re finished, but you will know how to fire a gun and throw a punch.
I can say without hesitation that taking this course is one of the most important things I’ve done in terms of tool development in my entire scientific career. If you are going to write more than 100 lines of code per year for your research then you need to either take this course or find someone to offer something equivalent at your university. Watch the lectures, do the exercises, and it will save you time and energy on programming; giving you more of both to dedicate to asking and answering important scientific questions.
¹I took 3 computer science courses in college and I get the impression that that is about 2-3 more courses than most ecologists have taken.
²I don’t know of any data on this, but my impression is that over 90% of code written by ecologists is written by a single individual and never read or used by anyone else. This is in part because we have no culture of writing code in such a way that other people can understand what we’ve done and therefore modify it for their own use.
³I know that I’ve decided that it was easier to “just start from scratch” rather than reusing my own code on more than one occasion. That won’t be happening to me again thanks to Software Carpentry
RStudio [Things you should use]
If you use R (and it seems like everybody does these days) then you should check out RStudio – an easy to install, cross-platform IDE for R. Basically it’s a seamless integration of all of the aspects of R (including scripts, the console, figures, help, etc.) into a single easy to use package. For those of you are familiar with Matlab, it’s a very similar interface. It’s not a full blown IDE yet (no debugger; no lint) but what this actually means is that it’s simple and easy to use. If you use R I can’t imagine that you won’t love this new (and open source!) tool.
UPDATE: Check out another nice article on RStudio over at i’m a chordata! urochordata!