Over the last year and a half we have been actively developing a semester-long Data Carpentry course designed to be easily customized and integrated into existing graduate and undergraduate curricula.
Data Carpentry for Biologists contains course materials for teaching scientists how to work more effectively with data. The course provides introductions to data management and relational databases, data manipulation and analysis, and data visualization. It covers the same general types of material as a two-day Data Carpentry workshop, but expands the materials and opportunities for practice into a full-length university course. The teaching material uses R and SQLite, with some corresponding materials for Python as well. To help students understand the direct applications to their interests, the examples and exercises focus on biological questions and working with real data. The course emphasizes using best practices to produce reusable and reproducible data analysis.
Active-learning Teaching Materials
Learning computing requires active practice by working through programming problems. Just diving in to computing is challenging for most scientists, so the course instruction is designed to combine short live-coding introductions to concepts followed immediately by the students working on a related exercise. Additional exercises are assigned later for practice. This follows the “I do”, “We do”, “You do” approach to teaching, which leverages the benefits of active-learning and flipped classrooms without leaving students who are less comfortable with the material feeling lost. The bulk of class time is spent working on assigned exercises with the instructor moving around the room helping guide students through things they don’t understand and engaging with students who are thinking about advanced applications of what they’ve learned.
This approach is the result of lots of reading about effective teaching methods and Ethan’s experience teaching this and related courses over the last six years at Utah State University and the University of Florida. It seems to work well for both students that get the material easily and those that find it more challenging. We’ve also tried to make these materials as useful as possible for self-guided students.
Open course development
Software Carpentry and Data Carpentry have shown how powerful collaborative lesson development can be and we’re interested in bringing that to the university classroom. We have designed the course materials to be modular and easy to modify, and the course website easy to clone and set up. All of the teaching materials and associated website files are openly available at the Data Carpentry for Biologists repository on GitHub under CC-BY and MIT licenses. The course materials are all written in Markdown and everything runs on Jekyll through GitHub Pages. Making your own version of the course should take less than an hour. We’ve developed documentation for how to create your own version of the course and how to contribute to development. Exercises and assignments are modular and changing exercises and assignments simply involves reordering items in a list. Adding a new exercise involves creating a new Markdown file and then adding its title to the list of exercises for an assignment.
If you teach, or want to teach, a course like this, we’d love to get you involved. Here are some useful links for getting started.
We want to be sure getting involved is as easy as possible. We’ve worked hard to provide documentation and help resources for students and instructors. Students can find all they need to know at our student start guide. Instructors have access to course content and site design documentation.
Development of this course was generously support by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.
We are very exited to announce the newest release of the EcoData Retriever, our software for automating the downloading, cleaning, and installing of ecological and environmental data. Instead of hours or days trying to get complicated datasets like the Breeding Bird Survey ready for analysis, the Retriever lets you simply click a button or run a single command from R or the command line, and your computer does the rest.
It’s been over a year since the last retriever release and there are lots of new features and improvements to be excited about.
- We’ve added 21 new datasets, including major ecological and environmental datasets like eBird, Vertnet, and the Global Wood Density Database, and the PRISM climate data.
- To support all of these datasets we’ve added support for additional data types including greater than memory archive files, and we’ve also improved the ability to control where downloaded files are stored and how they are clustered together.
- We’ve significantly improved documentation and now have a new automatically built documentation site at Read The Docs.
- We’ve also made a lot of under the hood improvements.
This is also the first release that has been overseen by Weecology’s new software engineer, Henry Senyondo. We’re excited to have Henry on the team, and now that he’s around development of both the EcoData Retriever and other lab software projects will be happening more quickly.
A big thanks to the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative for funding this development through Grant GBMF4563 and to the National Science Foundation for funding as part of a CAREER award to Ethan White.
data <- ecoretriever::fetch("BBS")
I’m looking for one or more graduate students to join my group next fall. In addition to the official add (below) I’d like to add a few extra thoughts. As Morgan Ernest noted in her recent ad, we have a relatively unique setup at Weecology in that we interact actively with members of the Ernest Lab. We share space, have joint lab meetings, and generally maintain a very close intellectual relationship. We do this with the goal of breaking down the barriers between the quantitative side of ecology and the field/lab side of ecology. Our goal is to train scientists who span these barriers in a way that allows them to tackle interesting and important questions.
I also believe it’s important to train students for multiple potential career paths. Members of my lab have gone on to faculty positions, postdocs, and jobs in both science non-profits and the software industry.
Scientists in my group regularly both write papers (e.g., these recent papers from dissertation chapters: Locey & White 2013, Xiao et al. 2014) and develop or contribute to software (e.g., EcoData Retriever, ecoretriever, rpartitions & pypartitions) even if they’ve never coded before they joined my lab.
My group generally works on problems at the population, community, and ecosystem levels of ecology. You can find out more about what we’ve been up to by checking out our website. If you’re interested in learning more about where the lab is headed I recommend reading my recently funded Moore Investigator in Data-Driven Discovery proposal.
PH.D STUDENT OPENINGS IN QUANTITATIVE, COMPUTATIONAL, AND MACRO- ECOLOGY
The White Lab at the University of Florida has openings for one or more PhD students in quantitative, computational, and/or macro- ecology to start fall 2015. The student(s) will be supported as graduate research assistants from a combination of NSF, Moore Foundation, and University of Florida sources depending on their research interests.
The White Lab uses computational, mathematical, and advanced statistical/machine learning methods to understand and make predictions/forecasts for ecological systems using large amounts of data. Background in quantitative and computational techniques is not necessary, only an interest in learning and applying them. Students are encouraged to develop their own research projects related to their interests.
The White Lab is currently at Utah State University, but is moving to the Department of Wildlife Ecology and Conservation at the University of Florida starting summer 2015.
Interested students should contact Dr. Ethan White (email@example.com) by Nov 15th, 2014 with their CV, GRE scores, and a brief statement of research interests.
UPDATE: Added a note that we work at population, community, and ecosystem levels.
We are very excited to announce the newest release of our EcoData Retriever software and the first release of a supporting R package, ecoretriever. If you’re not familiar with the EcoData Retriever you can read more here.
The biggest improvement to the Retriever in this set of releases is the ability to run it directly from R. Dan McGlinn did a great job leading the development of this package and we got ton of fantastic help from the folks at rOpenSci (most notably Scott Chamberlain, Gavin Simpson, and Karthik Ram). Now, once you install the main EcoData Retriever, you can run it from inside R by doing things like:
install.packages('ecoretriever') library(ecoretriever) # List the datasets available via the Retriever ecoretriever::datasets() # Install the Gentry dataset into csv files in your working directory ecoretriever::install('Gentry', 'csv') # Download the raw Gentry dataset files, without any processing, # to the subdirectory named data ecoretriever::download('Gentry', './data/') # Install and load a dataset as a list Gentry = ecoretriever::fetch('Gentry') names(Gentry) head(Gentry$counts)
The other big advance in this release is the ability to have the Retriever directly download files instead of processing them. This allows us to support data that doesn’t come in standard tabular forms. So, we can now include things like environmental data in GIS formats and phylogenetic data such as supertrees. We’ve used this new capability to allow the automatic downloading of the Bioclim data, one of the most widely used climate datasets in ecology, and the supertree for mammals from Fritz et al. 2009.
As a budding macroecologist, I have thought a lot about what skills I need to acquire during my Ph.D. This is my model of the four basic attributes for a macroecologist, although I think it is more generally applicable to many ecologists as well:
- Knowledge of SQL
- Dealing with proper database format and structure
- Finding data
- Appropriate treatments of data
- Understanding what good data are
- Monte Carlo methods
- Maximum likelihood methods
- Power analysis
- Higher level calculus
- Should be able to derive analytical solutions for problems
- Should be able to write programs for analysis, not just simple statistics and simple graphs.
- Able to use version control
- Once you can program in one language, you should be able to program in other languages without much effort, but should be fluent in at least one language.
Achieve expertise in at least 2 out of the 4 basic areas, but be able to communicate with people who have skills in the other areas. However, if you are good at collaboration and come up with really good questions, you can make up for skill deficiencies by collaborating with others who possess those skills. Start with smaller collaborations with the people in your lab, then expand outside your lab or increase the number of collaborators as your collaboration skills improve.
Achieving proficiency in an area is best done by using it for a project that you are interested in. The more you struggle with something, the better you understand it eventually, so working on a project is a better way to learn than trying to learn by completing exercises.
The attribute should be generalizable to other problems: For example, if you need to learn maximum likelihood for your project, you should understand how to apply it to other questions. If you need to run an SQL query to get data from one database, you should understand how to write an SQL query to get data from a different database.
In graduate school:
Someone who wants to compile their own data or work with existing data sets needs to develop a good intuitive feel for data; even if they cannot write SQL code, they need to understand what good and bad databases look like and develop a good sense for questionable data, and how known issues with data could affect the appropriateness of data for a given question. The data skill is also useful if a student is collecting field data, because a little bit of thought before data collection goes a long way toward preventing problems later on.
A student who is getting a terminal master’s and is planning on using pre-existing data should probably be focusing on the data skill (because data is a highly marketable skill, and understanding data prevents major mistakes). If the data are not coming from a central database, like the BBS, where the quality of the data is known, additional time will have to be added for time to compile data, time to clean the data, and time to figure out if the data can be used responsibly, and time to fill holes in the data.
Master’s students who want to go on for a Ph.D. should decide what questions they are interested in and should try to pick a project that focuses on learning a good skill that will give them a headstart- more empirical (programming or stats), more theoretical (math), more applied (math (e.g., for developing models), stats(e.g., applying pre-existing models and evaluating models, etc.), or programming (e.g. making tools for people to use)).
Ph.D. students need to figure out what types of questions they are interested in, and learn those skills that will allow them to answer those questions. Don’t learn a skill because it is trendy or you think it will help you get a job later if you don’t actually want to use that skill. Conversely, don’t shy away from learning a skill if it is essential for you to pursue the questions you are interested in.
Right now, as a Ph.D. student, I am specializing in data and programming. I speak enough math and stats that I can communicate with other scientists and learn the specific analytical techniques I need for a given project. For my interests (testing questions with large datasets), I think that by the time I am done with my Ph.D., I will have the skills I need to be fairly independent with my research.
There is an exciting postdoc opportunity for folks interested in quantitative approaches to studying evolution in Michael Gilchrist’s lab at the University of Tennessee. I knew Mike when we were both in New Mexico. He’s really sharp, a nice guy, and a very patient teacher. He taught me all about likelihood and numerical maximization and opened my mind to a whole new way of modeling biological systems. This will definitely be a great postdoc for the right person, especially since NIMBioS is at UTK as well. Here’s the ad:
Outstanding, motivated candidates are being sought for a post-doctoral position in the Gilchrist lab in the Department of Ecology & Evolutionary Biology at the University of Tennessee, Knoxville. The successful candidate will be supported by a three year NSF grant whose goal is to develop, integrate and test mathematical models of protein translation and sequence evolution using available genomic sequence and expression level datasets. Publications directly related to this work include Gilchrist. M.A. 2007, Molec. Bio. & Evol. (http://www.tinyurl/shahgilchrist11) and Shah, P. and M.A. Gilchrist 2011, PNAS (http://www.tinyurl/gilchrist07a).
The emphasis of the laboratory is focused on using biologically motivated models to analyze complex, heterogeneous datasets to answer biologically motivated questions. The research associated with this position draws upon a wide range of scientiﬁc disciplines including: cellular biology, evolutionary theory, statistical physics, protein folding, diﬀerential equations, and probability. Consequently, the ideal candidate would have a Ph.D. in either biology, mathematics, physics, computer science, engineering, or statistics with a background and interest in at least one of the other areas.
The researcher will collaborate closely with the PIs (Drs. Michael Gilchrist and Russell Zaretzki) on this project but potentiall have time to collaborate on other research projects with the PIs. In addition, the researcher will have opportunties to interact with other faculty members in the Division of Biology as well as researchers at the National Institute for Mathematical and Biological Synthesis (http://www.nimbios.org).
Review of applications begins immediately and will continue until the position is filled. To apply, please submit curriculum vitae including three references, a brief statement of research background and interests, and 1-3 relevant manuscripts to mikeg[at]utk[dot]edu.
Some time ago in academia we realized that it didn’t make sense for individual scientists or even entire departments to maintain their own high performance computing resources. Use of these resources by an individual is intensive, but sporadic, and maintenance of the resources is expensive  so the universities soon realized they were better off having centralized high performance computing centers so that computing resources were available when needed and the averaging effects of having large numbers of individuals using the same computers meant that the machines didn’t spend much time sitting idle. This was obviously a smart decision.
So, why haven’t universities been smart enough to centralize an even more valuable computational resource, their computer labs?
As any student of Software Carpentry will tell you, it is far more important to be able to program well than it is to have access to a really large high performance computing center. This means that the most important computational resource a university has is the classes that teach their students how to program, and the computer labs on which they rely.
At my university  all of the computer labs on campus are controlled by either individual departments or individual colleges. This means that if you want to teach a class in one of them you can’t request it as a room through the normal scheduling process, you have to ask the cognizant university fiefdom for permission. This wouldn’t be a huge issue, except that in my experience the answer is typically a resounding no. And it’s not a “no, where really sorry but the classroom is booked solid with our own classes,” it’s “no, that computer lab is ours, good luck” .
And this means that we end up wasting a lot of expensive university resources. For example, last year I taught in a computer lab “owned” by another college . I taught in the second class slot of a four slot afternoon. In the slot before my class there was a class that used the room about four times during the semester (out of 48 class periods). There were no classes in the other two afternoon slots . That means that classes were being taught in the lab only 27% of the time or 2% of the time if I hadn’t been granted an exception to use the lab .
Since computing skills are increasingly critical to many areas of science (and everything else for that matter) this territoriality with respect to computer labs means that they proliferate across campus. The departments/colleges of Computer Science, Engineering, Social Sciences, Natural Resources and Biology  all end up creating and maintaining their own computer labs, and those labs end up sitting empty (or being used by students to send email) most of the time. This is horrifyingly inefficient in an era where funds for higher education are increasingly hard to come by and where technology turns over at an ever increasing rate. Which  brings me to the title of this post. The solution to this problem is for universities to stop allowing computer labs to be controlled by individual colleges/departments in exactly the same way that most classrooms are not controlled by colleges/departments. Most universities have a central unit that schedules classrooms and classes are fit into the available spaces. There is of course a highly justified bias to putting classes in the buildings of the cognizant department, but large classes in particular may very well not be in the department’s building. It works this way because if it didn’t then the university would be wasting huge amounts of space having one or more lecture halls in every department, even if they were only needed a few hours a week. The same issue applies to computer labs, only they are also packed full of expensive electronics. So please universities, for the love of all that is good and right and simply fiscally sound in the world, start treating computer labs like what they are: really valuable and expensive classrooms.
 Think of a single scientist who keeps 10 expensive computers, only uses them a total of 1-2 months per year, but when he does the 10 computers aren’t really enough so he has to wait a long time to finish the analysis.
 And I think the point I’m about to make is generally true; at least it has been at several other universities I’ve worked over the years.
 Or in some cases something more like “Frak you. You fraking biologists have no fraking right to teach anyone a fraking thing about fraking computers.” Needless to say, the individual in question wasn’t actually saying frak, but this is a family blog.
 As a result of a personal favor done for one administrator by another administrator.
 I know because I took advantage of this to hold my office hours in the computer lab following class.
 To be fair it should be noted that this and other computer labs are often used by students for doing homework (along with other less educationally oriented activities) when classes are not using the rooms, but in this case the classroom was a small part of a much larger lab and since I never witnessed the non-classroom portion of the lab being filled to capacity, the argument stands.
 etc., etc., etc.