As a budding macroecologist, I have thought a lot about what skills I need to acquire during my Ph.D. This is my model of the four basic attributes for a macroecologist, although I think it is more generally applicable to many ecologists as well:
- Knowledge of SQL
- Dealing with proper database format and structure
- Finding data
- Appropriate treatments of data
- Understanding what good data are
- Monte Carlo methods
- Maximum likelihood methods
- Power analysis
- Higher level calculus
- Should be able to derive analytical solutions for problems
- Should be able to write programs for analysis, not just simple statistics and simple graphs.
- Able to use version control
- Once you can program in one language, you should be able to program in other languages without much effort, but should be fluent in at least one language.
Achieve expertise in at least 2 out of the 4 basic areas, but be able to communicate with people who have skills in the other areas. However, if you are good at collaboration and come up with really good questions, you can make up for skill deficiencies by collaborating with others who possess those skills. Start with smaller collaborations with the people in your lab, then expand outside your lab or increase the number of collaborators as your collaboration skills improve.
Achieving proficiency in an area is best done by using it for a project that you are interested in. The more you struggle with something, the better you understand it eventually, so working on a project is a better way to learn than trying to learn by completing exercises.
The attribute should be generalizable to other problems: For example, if you need to learn maximum likelihood for your project, you should understand how to apply it to other questions. If you need to run an SQL query to get data from one database, you should understand how to write an SQL query to get data from a different database.
In graduate school:
Someone who wants to compile their own data or work with existing data sets needs to develop a good intuitive feel for data; even if they cannot write SQL code, they need to understand what good and bad databases look like and develop a good sense for questionable data, and how known issues with data could affect the appropriateness of data for a given question. The data skill is also useful if a student is collecting field data, because a little bit of thought before data collection goes a long way toward preventing problems later on.
A student who is getting a terminal master’s and is planning on using pre-existing data should probably be focusing on the data skill (because data is a highly marketable skill, and understanding data prevents major mistakes). If the data are not coming from a central database, like the BBS, where the quality of the data is known, additional time will have to be added for time to compile data, time to clean the data, and time to figure out if the data can be used responsibly, and time to fill holes in the data.
Master’s students who want to go on for a Ph.D. should decide what questions they are interested in and should try to pick a project that focuses on learning a good skill that will give them a headstart- more empirical (programming or stats), more theoretical (math), more applied (math (e.g., for developing models), stats(e.g., applying pre-existing models and evaluating models, etc.), or programming (e.g. making tools for people to use)).
Ph.D. students need to figure out what types of questions they are interested in, and learn those skills that will allow them to answer those questions. Don’t learn a skill because it is trendy or you think it will help you get a job later if you don’t actually want to use that skill. Conversely, don’t shy away from learning a skill if it is essential for you to pursue the questions you are interested in.
Right now, as a Ph.D. student, I am specializing in data and programming. I speak enough math and stats that I can communicate with other scientists and learn the specific analytical techniques I need for a given project. For my interests (testing questions with large datasets), I think that by the time I am done with my Ph.D., I will have the skills I need to be fairly independent with my research.
Slides and script from Ethan White’s Ignite talk on Big Data in Ecology from Sandra Chung and Jacquelyn Gill‘s excellent ESA 2013 session on Sharing Makes Science Better. Slides are also archived on figshare.
1. I’m here to talk to you about the use of big data in ecology and to help motivate a lot of the great tools and approaches that other folks will talk about later in the session.
2. The definition of big is of course relative, and so when we talk about big data in ecology we typically mean big relative to our standard approaches based on observations and experiments conducted by single investigators or small teams.
3. And for those of you who prefer a more precise definition, my friend Michael Weiser defines big data and ecoinformatics as involving anything that can’t be successfully opened in Microsoft Excel.
4. Data can be of unusually large size in two ways. It can be inherently large, like citizen science efforts such as Breeding Bird Survey, where large amounts of data are collected in a consistent manner.
5. Or it can be large because it’s composed of a large number of small datasets that are compiled from sources like Dryad, figshare, and Ecological Archives to form useful compilation datasets for analysis.
6. We have increasing amounts of both kinds of data in ecology as a result of both major data collection efforts and an increased emphasis on sharing data.
7-8. But what does this kind of data buy us. First, big data allows us to work at scales beyond those at which traditional approaches are typically feasible. This is critical because many of the most pressing issues in ecology including climate change, biodiversity, and invasive species operate at broad spatial and long temporal scales.
9-10. Second, big data allows us to answer questions in general ways, so that we get the answer today instead of waiting a decade to gradually compile enough results to reach concensus. We can do this by testing theories using large amounts of data from across ecosystems and taxonomic groups, so that we know that our results are general, and not specific to a single system (e.g., White et al. 2012).
11. This is the promise of big data in ecology, but realizing this potential is difficult because working with either truly big data or data compilations is inherently challenging, and we still lack sufficient data to answer many important questions.
12. This means that if we are going to take full advantage of big data in ecology we need 3 things. Training in computational methods for ecologists, tools to make it easier to work with existing data, and more data.
13. We need to train ecologists in the computational tools needed for working with big data, and there are an increasing number of efforts to do this including Software Carpentry (which I’m actively involved in) as well as training initiatives at many of the data and synthesis centers.
14. We need systems for storing, distributing, and searching data like DataONE, Dryad, NEON‘s data portal, as well as the standardized metadata and associated tools that make finding data to answer a particular research question easier.
15. We need crowd-sourced systems like the Ecological Data Wiki to allow us to work together on improving insufficient metadata and understanding what kinds of analyses are appropriate for different datasets and how to conduct them rigorously.
16. We need tools for quickly and easily accessing data like rOpenSci and the EcoData Retriever so that we can spend our time thinking and analyzing data rather than figuring out how to access it and restructure it.
17. We also need systems that help turn small data into big data compilations, whether it be through centralized standardized databases like GBIF or tools that pull data together from disparate sources like Map of Life.
18. And finally we we need to continue to share more and more data and share it in useful ways. With the good formats, standardized metadata, and open licenses that make it easy to work with.
19. And so, what I would like to leave you with is that we live in an exciting time in ecology thanks to the generation of large amounts of data by citizen science projects, exciting federal efforts like NEON, and a shift in scientific culture towards sharing data openly.
20. If we can train ecologists to work with and combine existing tools in interesting ways, it will let us combine datasets spanning the surface of the globe and diversity of life to make meaningful predictions about ecological systems.
We have all bemoaned the increasing difficulty of keeping up with the growing body of literature. Many of us, me included, have been relying increasingly on following only a subset of journals, but with the growing popularity of the large open-access journals I know I for one am increasingly likely to miss papers. The purpose of this post isn’t to give you the panacea to your problems (sadly I don’t think there is a panacea to this issue, though I have hopes that someone will come up with something viable in the future). The purpose of this post is to let you know about an interesting addition or alternative (for the brave) to the frantic scanning of the table of contents or RSS feeds: Google Scholar.
Almost everyone at this point knows you can go to Google Scholar and search for key words and it’ll produce a list of papers. Did you also know that you can set up a google profile with your published articles and that Google can use that to find articles that might be of interest to you. How does it do that? I’ll have to quote Google’s Blog because it’s a little like voodoo to me (obviously this is Morgan writing this post, not Ethan): “We determine relevance using a statistical model that incorporates what your work is about, the citation graph between articles, the fact that interests can change over time, and the authors you work with and cite. “ When you go to Google Scholar’s homepage (and you’re logged in as you) it’ll notify you if there are new articles on your suggested list. I actually have been pleasantly surprised by the articles it has identified for me, including some book chapters I would never have seen. For example here’s several things that sound really interesting to me, but I would never have seen:
MC Emmerson – Marine Biodiversity and Ecosystem Functioning: …, 2012 – books.google.com
A Potochnik, B McGill – Philosophy of Science, 2012 – JSTOR
D West, J BRUCE – International Journal of Modern Physics B, 2012 – World Scientific
It doesn’t just search published journal articles. For example there are preprints from arXiv and government reports on my list. I don’t know if this would work as well for the young graduate students/postdocs since it uses the citations in your existing papers and our junior colleagues might have less data for Google to work with. However, once you have a profile, you can also follow other people who have profiles, which means you get an email every time scholarly work gets added to their profile. Are you a huge Simon Levin groupie? You can follow him and every time a paper gets added to his profile, you can get an email alerting you about the new paper. I also use this to follow a bunch of interesting younger people because they often publish less frequently or in journals I don’t happen to follow and this way I don’t miss their stuff when my Google Reader hits 1000+ articles to be perused! You can also sign up for alerts when someone you follow has their work cited. (And you can set up alerts for when your own work gets cited as well).
As I said before, I don’t think Google Scholar is a one-stop literature monitoring stop (yet), but I find it useful for getting me out of my high impact factor monitoring rut. The only thing you need to do is set up your Google Scholar profile and the only reason not to do that is if you’re worried it’ll give Google the edge when it finally becomes self-aware and renames itself Skynet (ha ha ha ha….hmmm).
There is an exciting postdoc opportunity for folks interested in quantitative approaches to studying evolution in Michael Gilchrist’s lab at the University of Tennessee. I knew Mike when we were both in New Mexico. He’s really sharp, a nice guy, and a very patient teacher. He taught me all about likelihood and numerical maximization and opened my mind to a whole new way of modeling biological systems. This will definitely be a great postdoc for the right person, especially since NIMBioS is at UTK as well. Here’s the ad:
Outstanding, motivated candidates are being sought for a post-doctoral position in the Gilchrist lab in the Department of Ecology & Evolutionary Biology at the University of Tennessee, Knoxville. The successful candidate will be supported by a three year NSF grant whose goal is to develop, integrate and test mathematical models of protein translation and sequence evolution using available genomic sequence and expression level datasets. Publications directly related to this work include Gilchrist. M.A. 2007, Molec. Bio. & Evol. (http://www.tinyurl/shahgilchrist11) and Shah, P. and M.A. Gilchrist 2011, PNAS (http://www.tinyurl/gilchrist07a).
The emphasis of the laboratory is focused on using biologically motivated models to analyze complex, heterogeneous datasets to answer biologically motivated questions. The research associated with this position draws upon a wide range of scientiﬁc disciplines including: cellular biology, evolutionary theory, statistical physics, protein folding, diﬀerential equations, and probability. Consequently, the ideal candidate would have a Ph.D. in either biology, mathematics, physics, computer science, engineering, or statistics with a background and interest in at least one of the other areas.
The researcher will collaborate closely with the PIs (Drs. Michael Gilchrist and Russell Zaretzki) on this project but potentiall have time to collaborate on other research projects with the PIs. In addition, the researcher will have opportunties to interact with other faculty members in the Division of Biology as well as researchers at the National Institute for Mathematical and Biological Synthesis (http://www.nimbios.org).
Review of applications begins immediately and will continue until the position is filled. To apply, please submit curriculum vitae including three references, a brief statement of research background and interests, and 1-3 relevant manuscripts to mikeg[at]utk[dot]edu.
Some time ago in academia we realized that it didn’t make sense for individual scientists or even entire departments to maintain their own high performance computing resources. Use of these resources by an individual is intensive, but sporadic, and maintenance of the resources is expensive  so the universities soon realized they were better off having centralized high performance computing centers so that computing resources were available when needed and the averaging effects of having large numbers of individuals using the same computers meant that the machines didn’t spend much time sitting idle. This was obviously a smart decision.
So, why haven’t universities been smart enough to centralize an even more valuable computational resource, their computer labs?
As any student of Software Carpentry will tell you, it is far more important to be able to program well than it is to have access to a really large high performance computing center. This means that the most important computational resource a university has is the classes that teach their students how to program, and the computer labs on which they rely.
At my university  all of the computer labs on campus are controlled by either individual departments or individual colleges. This means that if you want to teach a class in one of them you can’t request it as a room through the normal scheduling process, you have to ask the cognizant university fiefdom for permission. This wouldn’t be a huge issue, except that in my experience the answer is typically a resounding no. And it’s not a “no, where really sorry but the classroom is booked solid with our own classes,” it’s “no, that computer lab is ours, good luck” .
And this means that we end up wasting a lot of expensive university resources. For example, last year I taught in a computer lab “owned” by another college . I taught in the second class slot of a four slot afternoon. In the slot before my class there was a class that used the room about four times during the semester (out of 48 class periods). There were no classes in the other two afternoon slots . That means that classes were being taught in the lab only 27% of the time or 2% of the time if I hadn’t been granted an exception to use the lab .
Since computing skills are increasingly critical to many areas of science (and everything else for that matter) this territoriality with respect to computer labs means that they proliferate across campus. The departments/colleges of Computer Science, Engineering, Social Sciences, Natural Resources and Biology  all end up creating and maintaining their own computer labs, and those labs end up sitting empty (or being used by students to send email) most of the time. This is horrifyingly inefficient in an era where funds for higher education are increasingly hard to come by and where technology turns over at an ever increasing rate. Which  brings me to the title of this post. The solution to this problem is for universities to stop allowing computer labs to be controlled by individual colleges/departments in exactly the same way that most classrooms are not controlled by colleges/departments. Most universities have a central unit that schedules classrooms and classes are fit into the available spaces. There is of course a highly justified bias to putting classes in the buildings of the cognizant department, but large classes in particular may very well not be in the department’s building. It works this way because if it didn’t then the university would be wasting huge amounts of space having one or more lecture halls in every department, even if they were only needed a few hours a week. The same issue applies to computer labs, only they are also packed full of expensive electronics. So please universities, for the love of all that is good and right and simply fiscally sound in the world, start treating computer labs like what they are: really valuable and expensive classrooms.
 Think of a single scientist who keeps 10 expensive computers, only uses them a total of 1-2 months per year, but when he does the 10 computers aren’t really enough so he has to wait a long time to finish the analysis.
 And I think the point I’m about to make is generally true; at least it has been at several other universities I’ve worked over the years.
 Or in some cases something more like “Frak you. You fraking biologists have no fraking right to teach anyone a fraking thing about fraking computers.” Needless to say, the individual in question wasn’t actually saying frak, but this is a family blog.
 As a result of a personal favor done for one administrator by another administrator.
 I know because I took advantage of this to hold my office hours in the computer lab following class.
 To be fair it should be noted that this and other computer labs are often used by students for doing homework (along with other less educationally oriented activities) when classes are not using the rooms, but in this case the classroom was a small part of a much larger lab and since I never witnessed the non-classroom portion of the lab being filled to capacity, the argument stands.
 etc., etc., etc.
There is a new postdoctoral research position available in Jim Brown’s lab at the University of New Mexico to study some of the major patterns of biodiversity. We know a bit about the research and it’s going to be an awesome project with a bunch of incredibly bright people involved. Jim’s lab is also one of the most intellectually stimulating and supportive environments that you could possibly work in. Seriously, if you are even remotely qualified then you should apply for this position. We’re both thinking about applying and we already have faculty positions :). Here’s the full ad:
The Department of Biology at the University of New Mexico is seeking applications for a post-doc position in ecology/biodiversity. The post doc will be expected to play a major role in a multi-investigator, multi- institutional project supported by a four-year NSF Macrosystems Ecology grant. The research will focus on metabolic processes underlying the major patterns of biodiversity, especially in pervasive temperature dependence and requires a demonstrated working knowledge of theory, mathematical and computer
Applicants must have a Ph.D. in ecology or a related discipline.
Review begins with the first applications and continues until the position is filled. Applicants must submit a cover letter and a curriculum vitae along with at least three phone numbers of references, three letters of recommendation and PDF’s of relevant preprints and publications to be sent directly to email@example.com attn: James Brown. Application materials must be received by July 25, 2011, for best consideration.
Questions related to this posting may be directed to Dr. James Brown at firstname.lastname@example.org or to Katherine Thannisch at email@example.com.
The University of New Mexico is an Equal Opportunity/Affirmative Action Employer and Educator. Women and underrepresented minorities are encouraged to apply.
There is an excellent post on open science, prestige economies, and the social web over at Marciovm’s posterous*. For those of you who aren’t insanely nerdy** GitHub is… well… let’s just call it a very impressive collaborative tool for developing and sharing software***. But don’t worry, you don’t need to spend your days tied to a computer or have any interest in writing your own software to enjoy gems like:
Evangelists for Open Science should focus on promoting new, post-publication prestige metrics that will properly incentivize scientists to focus on the utility of their work, which will allow them to start worrying less about publishing in the right journals.
*A blog I’d never heard of before, but I subscribed to it’s RSS feed before I’d even finished the entire post.
**As far as biologists go. And, yes, when I say “insanely nerdy” I do mean it as a complement.