We are very excited to announce the newest release of our EcoData Retriever software and the first release of a supporting R package, ecoretriever. If you’re not familiar with the EcoData Retriever you can read more here.
The biggest improvement to the Retriever in this set of releases is the ability to run it directly from R. Dan McGlinn did a great job leading the development of this package and we got ton of fantastic help from the folks at rOpenSci (most notably Scott Chamberlain, Gavin Simpson, and Karthik Ram). Now, once you install the main EcoData Retriever, you can run it from inside R by doing things like:
install.packages('ecoretriever') library(ecoretriever) # List the datasets available via the Retriever ecoretriever::datasets() # Install the Gentry dataset into csv files in your working directory ecoretriever::install('Gentry', 'csv') # Download the raw Gentry dataset files, without any processing, # to the subdirectory named data ecoretriever::download('Gentry', './data/') # Install and load a dataset as a list Gentry = ecoretriever::fetch('Gentry') names(Gentry) head(Gentry$counts)
The other big advance in this release is the ability to have the Retriever directly download files instead of processing them. This allows us to support data that doesn’t come in standard tabular forms. So, we can now include things like environmental data in GIS formats and phylogenetic data such as supertrees. We’ve used this new capability to allow the automatic downloading of the Bioclim data, one of the most widely used climate datasets in ecology, and the supertree for mammals from Fritz et al. 2009.
EcoData Retriever: quickly download and cleanup ecological data so you can get back to doing science
If you’ve every worked with scientific data, your own or someone elses, you know that you can end up spending a lot of time just cleaning up the data and getting it in a state that makes it ready for analysis. This involves everything from cleaning up non-standard nulls values to completely restructuring the data so that tools like R, Python, and database management systems (e.g., MS Access, PostgreSQL) know how to work with them. Doing this for one dataset can be a lot of work and if you work with a number of different databases like I do the time and energy can really take away from the time you have to actually do science.
Over the last few years Ben Morris and I been working on a project called the EcoData Retriever to make this process easier and more repeatable for ecologists. With a click of a button, or a single call from the command line, the Retriever will download an ecological dataset, clean it up, restructure and assemble it (if necessary) and install it into your database management system of choice (including MS Access, PostgreSQL, MySQL, or SQLite) or provide you with CSV files to load into R, Python, or Excel.
Just click on the box to get the data:
Or run a command like this from the command line:
retriever install msaccess BBS --file myaccessdb.accdb
This means that instead of spending a couple of days wrangling a large dataset like the North American Breeding Bird Survey into a state where you can do some science, you just ask the Retriever to take care of it for you. If you work actively with Breeding Bird Survey data and you always like to use the most up to date version with the newest data and the latest error corrections, this can save you a couple of days a year. If you also work with some of the other complicated ecological datasets like Forest Inventory and Analysis and Alwyn Gentry’s Forest Transect data, the time savings can easily be a week.
The Retriever handles things like:
- Creating the underlying database structures
- Automatically determining delimiters and data types
- Downloading the data (and if there are over 100 data files that can be a lot of clicks)
- Transforming data into standard structures so that common tools in R and Python and relational database management systems know how to work with it (e.g., converting cross-tabulated data)
- Converting non-standard null values (e.g., 999.0, -999, NoData) into standard ones
- Combining multiple data files into single tables
- Placing all related tables in a single database or schema
The EcoData Retriever currently includes a number of large, openly available, ecological datasets (see a full list here). It’s also easy to add new datasets to the EcoData Retriever if you want to. For simple data tables a Retriever script can be as simple as:
name: Name of the dataset description: A brief description of the dataset of ~25 words. shortname: A one word name for the dataset table: MyTableName, http://awesomedatasource.com/dataset
We also have some exciting new features on the To Do list including:
- Automatically cleaning up the taxonomy using existing services
- Providing detailed tracking of the provenance of your data by recording the date it was downloaded, the version of the software used, and information about what cleanup steps the Retriever performed
- Integration into R and Python
Let us know what you think we should work on next in the comments.
As a budding macroecologist, I have thought a lot about what skills I need to acquire during my Ph.D. This is my model of the four basic attributes for a macroecologist, although I think it is more generally applicable to many ecologists as well:
- Knowledge of SQL
- Dealing with proper database format and structure
- Finding data
- Appropriate treatments of data
- Understanding what good data are
- Monte Carlo methods
- Maximum likelihood methods
- Power analysis
- Higher level calculus
- Should be able to derive analytical solutions for problems
- Should be able to write programs for analysis, not just simple statistics and simple graphs.
- Able to use version control
- Once you can program in one language, you should be able to program in other languages without much effort, but should be fluent in at least one language.
Achieve expertise in at least 2 out of the 4 basic areas, but be able to communicate with people who have skills in the other areas. However, if you are good at collaboration and come up with really good questions, you can make up for skill deficiencies by collaborating with others who possess those skills. Start with smaller collaborations with the people in your lab, then expand outside your lab or increase the number of collaborators as your collaboration skills improve.
Achieving proficiency in an area is best done by using it for a project that you are interested in. The more you struggle with something, the better you understand it eventually, so working on a project is a better way to learn than trying to learn by completing exercises.
The attribute should be generalizable to other problems: For example, if you need to learn maximum likelihood for your project, you should understand how to apply it to other questions. If you need to run an SQL query to get data from one database, you should understand how to write an SQL query to get data from a different database.
In graduate school:
Someone who wants to compile their own data or work with existing data sets needs to develop a good intuitive feel for data; even if they cannot write SQL code, they need to understand what good and bad databases look like and develop a good sense for questionable data, and how known issues with data could affect the appropriateness of data for a given question. The data skill is also useful if a student is collecting field data, because a little bit of thought before data collection goes a long way toward preventing problems later on.
A student who is getting a terminal master’s and is planning on using pre-existing data should probably be focusing on the data skill (because data is a highly marketable skill, and understanding data prevents major mistakes). If the data are not coming from a central database, like the BBS, where the quality of the data is known, additional time will have to be added for time to compile data, time to clean the data, and time to figure out if the data can be used responsibly, and time to fill holes in the data.
Master’s students who want to go on for a Ph.D. should decide what questions they are interested in and should try to pick a project that focuses on learning a good skill that will give them a headstart- more empirical (programming or stats), more theoretical (math), more applied (math (e.g., for developing models), stats(e.g., applying pre-existing models and evaluating models, etc.), or programming (e.g. making tools for people to use)).
Ph.D. students need to figure out what types of questions they are interested in, and learn those skills that will allow them to answer those questions. Don’t learn a skill because it is trendy or you think it will help you get a job later if you don’t actually want to use that skill. Conversely, don’t shy away from learning a skill if it is essential for you to pursue the questions you are interested in.
Right now, as a Ph.D. student, I am specializing in data and programming. I speak enough math and stats that I can communicate with other scientists and learn the specific analytical techniques I need for a given project. For my interests (testing questions with large datasets), I think that by the time I am done with my Ph.D., I will have the skills I need to be fairly independent with my research.
Slides and script from Ethan White’s Ignite talk on Big Data in Ecology from Sandra Chung and Jacquelyn Gill‘s excellent ESA 2013 session on Sharing Makes Science Better. Slides are also archived on figshare.
1. I’m here to talk to you about the use of big data in ecology and to help motivate a lot of the great tools and approaches that other folks will talk about later in the session.
2. The definition of big is of course relative, and so when we talk about big data in ecology we typically mean big relative to our standard approaches based on observations and experiments conducted by single investigators or small teams.
3. And for those of you who prefer a more precise definition, my friend Michael Weiser defines big data and ecoinformatics as involving anything that can’t be successfully opened in Microsoft Excel.
4. Data can be of unusually large size in two ways. It can be inherently large, like citizen science efforts such as Breeding Bird Survey, where large amounts of data are collected in a consistent manner.
5. Or it can be large because it’s composed of a large number of small datasets that are compiled from sources like Dryad, figshare, and Ecological Archives to form useful compilation datasets for analysis.
6. We have increasing amounts of both kinds of data in ecology as a result of both major data collection efforts and an increased emphasis on sharing data.
7-8. But what does this kind of data buy us. First, big data allows us to work at scales beyond those at which traditional approaches are typically feasible. This is critical because many of the most pressing issues in ecology including climate change, biodiversity, and invasive species operate at broad spatial and long temporal scales.
9-10. Second, big data allows us to answer questions in general ways, so that we get the answer today instead of waiting a decade to gradually compile enough results to reach concensus. We can do this by testing theories using large amounts of data from across ecosystems and taxonomic groups, so that we know that our results are general, and not specific to a single system (e.g., White et al. 2012).
11. This is the promise of big data in ecology, but realizing this potential is difficult because working with either truly big data or data compilations is inherently challenging, and we still lack sufficient data to answer many important questions.
12. This means that if we are going to take full advantage of big data in ecology we need 3 things. Training in computational methods for ecologists, tools to make it easier to work with existing data, and more data.
13. We need to train ecologists in the computational tools needed for working with big data, and there are an increasing number of efforts to do this including Software Carpentry (which I’m actively involved in) as well as training initiatives at many of the data and synthesis centers.
14. We need systems for storing, distributing, and searching data like DataONE, Dryad, NEON‘s data portal, as well as the standardized metadata and associated tools that make finding data to answer a particular research question easier.
15. We need crowd-sourced systems like the Ecological Data Wiki to allow us to work together on improving insufficient metadata and understanding what kinds of analyses are appropriate for different datasets and how to conduct them rigorously.
16. We need tools for quickly and easily accessing data like rOpenSci and the EcoData Retriever so that we can spend our time thinking and analyzing data rather than figuring out how to access it and restructure it.
17. We also need systems that help turn small data into big data compilations, whether it be through centralized standardized databases like GBIF or tools that pull data together from disparate sources like Map of Life.
18. And finally we we need to continue to share more and more data and share it in useful ways. With the good formats, standardized metadata, and open licenses that make it easy to work with.
19. And so, what I would like to leave you with is that we live in an exciting time in ecology thanks to the generation of large amounts of data by citizen science projects, exciting federal efforts like NEON, and a shift in scientific culture towards sharing data openly.
20. If we can train ecologists to work with and combine existing tools in interesting ways, it will let us combine datasets spanning the surface of the globe and diversity of life to make meaningful predictions about ecological systems.
We have all bemoaned the increasing difficulty of keeping up with the growing body of literature. Many of us, me included, have been relying increasingly on following only a subset of journals, but with the growing popularity of the large open-access journals I know I for one am increasingly likely to miss papers. The purpose of this post isn’t to give you the panacea to your problems (sadly I don’t think there is a panacea to this issue, though I have hopes that someone will come up with something viable in the future). The purpose of this post is to let you know about an interesting addition or alternative (for the brave) to the frantic scanning of the table of contents or RSS feeds: Google Scholar.
Almost everyone at this point knows you can go to Google Scholar and search for key words and it’ll produce a list of papers. Did you also know that you can set up a google profile with your published articles and that Google can use that to find articles that might be of interest to you. How does it do that? I’ll have to quote Google’s Blog because it’s a little like voodoo to me (obviously this is Morgan writing this post, not Ethan): “We determine relevance using a statistical model that incorporates what your work is about, the citation graph between articles, the fact that interests can change over time, and the authors you work with and cite. “ When you go to Google Scholar’s homepage (and you’re logged in as you) it’ll notify you if there are new articles on your suggested list. I actually have been pleasantly surprised by the articles it has identified for me, including some book chapters I would never have seen. For example here’s several things that sound really interesting to me, but I would never have seen:
MC Emmerson – Marine Biodiversity and Ecosystem Functioning: …, 2012 – books.google.com
A Potochnik, B McGill – Philosophy of Science, 2012 – JSTOR
D West, J BRUCE – International Journal of Modern Physics B, 2012 – World Scientific
It doesn’t just search published journal articles. For example there are preprints from arXiv and government reports on my list. I don’t know if this would work as well for the young graduate students/postdocs since it uses the citations in your existing papers and our junior colleagues might have less data for Google to work with. However, once you have a profile, you can also follow other people who have profiles, which means you get an email every time scholarly work gets added to their profile. Are you a huge Simon Levin groupie? You can follow him and every time a paper gets added to his profile, you can get an email alerting you about the new paper. I also use this to follow a bunch of interesting younger people because they often publish less frequently or in journals I don’t happen to follow and this way I don’t miss their stuff when my Google Reader hits 1000+ articles to be perused! You can also sign up for alerts when someone you follow has their work cited. (And you can set up alerts for when your own work gets cited as well).
As I said before, I don’t think Google Scholar is a one-stop literature monitoring stop (yet), but I find it useful for getting me out of my high impact factor monitoring rut. The only thing you need to do is set up your Google Scholar profile and the only reason not to do that is if you’re worried it’ll give Google the edge when it finally becomes self-aware and renames itself Skynet (ha ha ha ha….hmmm).
There is an exciting postdoc opportunity for folks interested in quantitative approaches to studying evolution in Michael Gilchrist’s lab at the University of Tennessee. I knew Mike when we were both in New Mexico. He’s really sharp, a nice guy, and a very patient teacher. He taught me all about likelihood and numerical maximization and opened my mind to a whole new way of modeling biological systems. This will definitely be a great postdoc for the right person, especially since NIMBioS is at UTK as well. Here’s the ad:
Outstanding, motivated candidates are being sought for a post-doctoral position in the Gilchrist lab in the Department of Ecology & Evolutionary Biology at the University of Tennessee, Knoxville. The successful candidate will be supported by a three year NSF grant whose goal is to develop, integrate and test mathematical models of protein translation and sequence evolution using available genomic sequence and expression level datasets. Publications directly related to this work include Gilchrist. M.A. 2007, Molec. Bio. & Evol. (http://www.tinyurl/shahgilchrist11) and Shah, P. and M.A. Gilchrist 2011, PNAS (http://www.tinyurl/gilchrist07a).
The emphasis of the laboratory is focused on using biologically motivated models to analyze complex, heterogeneous datasets to answer biologically motivated questions. The research associated with this position draws upon a wide range of scientiﬁc disciplines including: cellular biology, evolutionary theory, statistical physics, protein folding, diﬀerential equations, and probability. Consequently, the ideal candidate would have a Ph.D. in either biology, mathematics, physics, computer science, engineering, or statistics with a background and interest in at least one of the other areas.
The researcher will collaborate closely with the PIs (Drs. Michael Gilchrist and Russell Zaretzki) on this project but potentiall have time to collaborate on other research projects with the PIs. In addition, the researcher will have opportunties to interact with other faculty members in the Division of Biology as well as researchers at the National Institute for Mathematical and Biological Synthesis (http://www.nimbios.org).
Review of applications begins immediately and will continue until the position is filled. To apply, please submit curriculum vitae including three references, a brief statement of research background and interests, and 1-3 relevant manuscripts to mikeg[at]utk[dot]edu.
Some time ago in academia we realized that it didn’t make sense for individual scientists or even entire departments to maintain their own high performance computing resources. Use of these resources by an individual is intensive, but sporadic, and maintenance of the resources is expensive  so the universities soon realized they were better off having centralized high performance computing centers so that computing resources were available when needed and the averaging effects of having large numbers of individuals using the same computers meant that the machines didn’t spend much time sitting idle. This was obviously a smart decision.
So, why haven’t universities been smart enough to centralize an even more valuable computational resource, their computer labs?
As any student of Software Carpentry will tell you, it is far more important to be able to program well than it is to have access to a really large high performance computing center. This means that the most important computational resource a university has is the classes that teach their students how to program, and the computer labs on which they rely.
At my university  all of the computer labs on campus are controlled by either individual departments or individual colleges. This means that if you want to teach a class in one of them you can’t request it as a room through the normal scheduling process, you have to ask the cognizant university fiefdom for permission. This wouldn’t be a huge issue, except that in my experience the answer is typically a resounding no. And it’s not a “no, where really sorry but the classroom is booked solid with our own classes,” it’s “no, that computer lab is ours, good luck” .
And this means that we end up wasting a lot of expensive university resources. For example, last year I taught in a computer lab “owned” by another college . I taught in the second class slot of a four slot afternoon. In the slot before my class there was a class that used the room about four times during the semester (out of 48 class periods). There were no classes in the other two afternoon slots . That means that classes were being taught in the lab only 27% of the time or 2% of the time if I hadn’t been granted an exception to use the lab .
Since computing skills are increasingly critical to many areas of science (and everything else for that matter) this territoriality with respect to computer labs means that they proliferate across campus. The departments/colleges of Computer Science, Engineering, Social Sciences, Natural Resources and Biology  all end up creating and maintaining their own computer labs, and those labs end up sitting empty (or being used by students to send email) most of the time. This is horrifyingly inefficient in an era where funds for higher education are increasingly hard to come by and where technology turns over at an ever increasing rate. Which  brings me to the title of this post. The solution to this problem is for universities to stop allowing computer labs to be controlled by individual colleges/departments in exactly the same way that most classrooms are not controlled by colleges/departments. Most universities have a central unit that schedules classrooms and classes are fit into the available spaces. There is of course a highly justified bias to putting classes in the buildings of the cognizant department, but large classes in particular may very well not be in the department’s building. It works this way because if it didn’t then the university would be wasting huge amounts of space having one or more lecture halls in every department, even if they were only needed a few hours a week. The same issue applies to computer labs, only they are also packed full of expensive electronics. So please universities, for the love of all that is good and right and simply fiscally sound in the world, start treating computer labs like what they are: really valuable and expensive classrooms.
 Think of a single scientist who keeps 10 expensive computers, only uses them a total of 1-2 months per year, but when he does the 10 computers aren’t really enough so he has to wait a long time to finish the analysis.
 And I think the point I’m about to make is generally true; at least it has been at several other universities I’ve worked over the years.
 Or in some cases something more like “Frak you. You fraking biologists have no fraking right to teach anyone a fraking thing about fraking computers.” Needless to say, the individual in question wasn’t actually saying frak, but this is a family blog.
 As a result of a personal favor done for one administrator by another administrator.
 I know because I took advantage of this to hold my office hours in the computer lab following class.
 To be fair it should be noted that this and other computer labs are often used by students for doing homework (along with other less educationally oriented activities) when classes are not using the rooms, but in this case the classroom was a small part of a much larger lab and since I never witnessed the non-classroom portion of the lab being filled to capacity, the argument stands.
 etc., etc., etc.