An increasingly large number of folks doing research in ecology and other biological disciplines spend a substantial portion of their time writing computer programs to analyze data and simulate the outcomes of biological models. However, most ecologists have little formal training in software development¹. A recent survey suggests that we are not only; with 96% of scientists reporting that they are mostly self-taught when it comes to writing code. This makes sense because there are only so many hours in the day, and scientists are typically more interested in answering important questions in their field than in sitting through a bachelors degree worth of computer science classes. But, it also means that we spend longer than necessary writing our software, it contains more bugs, and it is less useful to other scientists than it could be².
Software Carpentry to the Rescue
Fortunately you don’t need to go back college and get another degree to substantially improve your knowledge and abilities when it comes to scientific programming, because with a few weeks of hard work Software Carpentry will whip you into shape. Software Carpentry was started back in 1997 to teach scientists “the concepts, skills, and tools they need to use and build software more productively” and it does a great job. The newest version of the course is composed of a combination of video lectures and exercises, and provides quick and to the point information on such critical things as:
along with lots of treatment of best practices for writing code that is clear and easy to read both for other people and for yourself a year from now when you sit down and try to figure out exactly what you did³.
The great thing about Software Carpentry is that it skips over all of the theory and detail that you’d get when taking the relevant courses in computer science and gets straight to crux – how to use the available tools most effectively to conduct scientific research. This means that in about 40 hours of lecture and 100-200 hours of practice you can be a much, much, better programmer who rights code more quickly, with fewer bugs, that be easily reused. I think of it as boot camp for scientific software development. You won’t be an expert marksman or a black belt in Jiu-Jitsu when you’re finished, but you will know how to fire a gun and throw a punch.
I can say without hesitation that taking this course is one of the most important things I’ve done in terms of tool development in my entire scientific career. If you are going to write more than 100 lines of code per year for your research then you need to either take this course or find someone to offer something equivalent at your university. Watch the lectures, do the exercises, and it will save you time and energy on programming; giving you more of both to dedicate to asking and answering important scientific questions.
¹I took 3 computer science courses in college and I get the impression that that is about 2-3 more courses than most ecologists have taken.
²I don’t know of any data on this, but my impression is that over 90% of code written by ecologists is written by a single individual and never read or used by anyone else. This is in part because we have no culture of writing code in such a way that other people can understand what we’ve done and therefore modify it for their own use.
³I know that I’ve decided that it was easier to “just start from scratch” rather than reusing my own code on more than one occasion. That won’t be happening to me again thanks to Software Carpentry
If you use R (and it seems like everybody does these days) then you should check out RStudio – an easy to install, cross-platform IDE for R. Basically it’s a seamless integration of all of the aspects of R (including scripts, the console, figures, help, etc.) into a single easy to use package. For those of you are familiar with Matlab, it’s a very similar interface. It’s not a full blown IDE yet (no debugger; no lint) but what this actually means is that it’s simple and easy to use. If you use R I can’t imagine that you won’t love this new (and open source!) tool.
The Ecological Database Toolkit
Large amounts of ecological and environmental data are becoming increasingly available due to initiatives sponsoring the collection of large-scale data and efforts to increase the publication of already collected datasets. As a result, ecology is entering an era where progress will be increasingly limited by the speed at which we can organize and analyze data. To help improve ecologists’ ability to quickly access and analyze data we have been developing software that designs database structures for ecological datasets and then downloads the data, processes it, and installs it into several major database management systems (at the moment we support Microsoft Access, MySQL, PostgreSQL, and SQLite). The database toolkit system can substantially reduce hurdles to scientists using new databases, and save time and reduce import errors for more experienced users.
The database toolkit can download and install small datasets in seconds and large datasets in minutes. Imagine being able to download and import the newest version of the Breeding Bird Survey of North America (a database with 4 major tables and over 5 million records in the main table) in less than five minutes. Instead of spending an afternoon setting up the newest version of the dataset and checking your import for errors you could spend that afternoon working on your research. This is possible right now and we are working on making this possible for as many major public/semi-public ecological databases as possible. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days. We hope that this will make it much more likely that scientists will use multiple datasets in their analyses; allowing them to gain more rapid insight into the generality of the pattern/process they are studying.
We need your help
We have done quite a bit of testing on this system including building in automated tests based on manual imports of most of the currently available databases, but there are always bugs and imperfections in code that cannot be identified until the software is used in real world situations. That’s why we’re looking for folks to come try out the Database Toolkit and let us know what works and what doesn’t, what they’d like to see added or taken away, and if/when the system fails to work properly. So if you’ve got a few minutes to have half a dozen ecological databases automatically installed on your computer for you stop by the Database Toolkit page at EcologicalData.org, give it a try, and let us know what you think.
We have a postdoc position available for someone interested in the general areas of macroecology, quantitative ecology, and ecoinformatics. Here’s the short ad with links to the full job description:
Ethan White’s lab at Utah State University is looking for a postdoc to collaborate on research studying approaches for unifying macroecological patterns (e.g., species abundance distributions and species-area relationships) and predicting variation in these patterns using ecological and environmental variables. The project aims to 1) evaluate the performance of models that link ecological patterns by using broad scale data on at least three major taxonomic groups (birds, plants, and mammals); and 2) combine models with ecological and environmental factors to explain continental scale variation in community structure. Models to be explored include maximum entropy models, neutral models, fractal based models, and statistical models. The postdoc will also be involved in an ecoinformatics initiative developing tools to facilitate the use of existing ecological data. There will be ample opportunity for independent and collaborative research in related areas of macroecology, community ecology, theoretical ecology, and ecoinformatics. The postdoc will benefit from interactions with researchers in Dr. White’s lab, the Weecology Interdisciplinary Research Group, and with Dr. John Harte’s lab at the University of California Berkeley. Applicants from a variety of backgrounds including ecology, mathematics, statistics, physics and computer science are encouraged to apply. The position is available for 1 year with the possibility for renewal depending on performance, and could begin as early as September 2010 and no later than May 2011. Applications will begin to be considered starting on September 1, 2010. Go to the USU job page to see the full advertisement and to apply.
If you’re interested in the position and are planning to be at ESA please leave a comment or drop me an email (email@example.com) and we can try to set up a time to talk while we’re in Pittsburgh. Questions about the position and expressions of interest are also welcome.
UPDATE: This position has been filled.
I’ve recently started reading two scientific programming blogs that I think are well worth paying attention to, so I’m blogrolling them and offering a brief introduction here.
Serendipity is Steve Easterbrook’s blog about the interface between software engineering and climate science. Steve has a realistic and balanced viewpoint regarding the reality of programming in scientific disciplines. The blog is well written, insightful, etc., but I think the thing that really won me over were his sharp witted responses to the periodically asinine comments he receives. For example:
I’d care a lot less about seeing all the source and data if I could just ignore climate scientists and shop elsewhere. But since I’m expected to hand over $$$ and change my lifestyle because of this research, your arguments ring hollow…
[You can shop elsewhere – there are thousands of climate scientists across the world. If you don’t like the CRU folks, go to any one of a large number of climate science labs elsewhere (start here: http://www.realclimate.org/index.php/data-sources/). An analogy: Imagine your doctor told you that you have to change your eating habits, or your heart is unlikely to last out the year. You would go and get a second opinion from another doctor. And maybe a third. But when every qualified doctor tells you the same thing, do you finally accept their advice, or do you go around claiming that all doctors are corrupt? – Steve]
Software Carpentry is the sister blog to an excellent online (and occasionally in person) course on basic software development for scientists. I strongly recommend the course to anyone who is interested in getting more serious about their programming and the blog is a nice complement pointing readers to other resources and discussions related to scientific programming.