The Weecology lab group run by Ethan White and Morgan Ernest at the University of Florida is seeking a Data Analyst to work collaboratively with faculty, graduate students, and postdocs to understand and model ecological systems. We’re looking for someone who enjoys tidying, managing, manipulating, visualizing, and analyzing data to help support scientific discovery.
The position will include:
- Organizing, analyzing, and visualizing large amounts of ecological data, including spatial and remotely sensed data. Modifying existing analytical approaches and data protocols as needed.
- Planning and executing the analysis of data related to newly forming questions from the group. Assisting in the statistical analysis of ecological data, as determined by the needs of the research group.
- Providing assistance and guidance to members of the research group on existing research projects. Working collaboratively with undergraduates, graduate students and postdocs in the group and from related projects.
- Learning new analytical tools and software as needed.
This is a staff position in the group and will be focused on data management and analysis. All members of this collaborative group are considered equal partners in the scientific process and this position will be actively involved in collaborations. Weecology believes in the importance of open science, so most work done as part of this position will involve writing open source code, use of open source software, and production and use of open data.
Weecology is a partnership between the White Lab, which studies ecology using quantitative and computational approaches and the Ernest Lab, which tends to be more field and community ecology oriented. The Weecology group supports and encourages members interested in a variety of career paths. Former weecologists are currently employed in the tech industry, with the National Ecological Observatory Network, as faculty at teaching-focused colleges, and as postdocs and faculty at research universities. We are also committed to supporting and training a diverse scientific workforce. Current and former group members encompass a variety of racial and ethnic backgrounds from the U.S. and other countries, members of the LGBTQ community, military veterans, people with chronic illnesses, and first-generation college students. More information about the Weecology group and respective labs is available on our website. You can also check us out on Twitter (@skmorgane, @ethanwhite, @weecology, GitHub, and our blog Jabberwocky Ecology.
The ideal candidate will have:
- Experience working with data in R or Python, some exposure to version control (preferably Git and GitHub), and potentially some background with database management systems (e.g., PostgreSQL, SQLite, MySQL) and spatial data.
- Research experience in ecology
- Interest in open approaches to science
- Experience collecting or working with ecological data
That said, don’t let the absence of any of these stop you from applying. If this sounds like a job you’d like to have please go ahead and put in an application.
We currently have funding for this position for 2.5 years. Minimum salary is $40,000/year (which goes a pretty long way in Gainesville), but there is significant flexibility in this number for highly qualified candidates. We are open to the possibility of someone working remotely. The position will remain open until filled, with initial review of applications beginning on May 5th. If you’re interested in applying you can do so through the official UF position page. If you have any questions or just want to let us know that you’re applying you can email Weecology’s project manager Glenda Yenni at email@example.com.
The Weecology lab group run by Morgan Ernest and Ethan White at the University of Florida is seeking a post-doctoral researcher to study changes in ecological communities through time. This position will primarily involve broad-scale comparative analyses across communities using large time-series datasets and/or in-depth analyses of our own long-term dataset (the Portal Project). Experience with any of the following is useful, but not required: long-term data, macroecology, paleoecology, quantitative/theoretical ecology, and programming/data analysis in R or Python. The successful applicant will be expected to collaborate on lab projects on community dynamics and develop their own research projects in this area according to their interests.
Weecology is a partnership between the Ernest Lab, which tends to be more field and community ecology oriented and the White Lab, which tends to be more quantitative and computationally oriented. The Weecology group supports and encourages students interested in a variety of career paths. Former weecologists are currently employed in the tech industry, with the National Ecological Observatory Network, as faculty at teaching-focused colleges, and as postdocs and faculty at research universities. We are also committed to supporting and training a diverse scientific workforce. Current and former group members encompass a variety of racial and ethnic backgrounds from the U.S. and other countries, members of the LGBTQ community, military veterans, people with chronic illnesses, and first-generation college students. More information about the Weecology group and respective labs is available on our website. You can also check us out on Twitter (@skmorgane, @ethanwhite, @weecology), GitHub, and our blog Jabberwocky Ecology.
This 2-year postdoc has a flexible start date, but can start as early as June 1st 2017. Interested students should contact Dr. Morgan Ernest (firstname.lastname@example.org) with their CV including a list of three references, a cover letter detailing their research interests/experiences, and one or more research samples (a PDF or link to a scientific product such as a published paper, preprint, software, data analysis code, etc). The position will remain open until filled, with initial review of applications beginning on April 24th.
We are very exited to announce a major new release of the Data Retriever, our software for making it quick and easy to get clean, ready to analyze, versions of publicly available data.
The Data Retriever, automates the downloading, cleaning, and installing of ecological and environmental data into your choice of databases and flat file formats. Instead of hours tracking down the data on the web, downloading it, trying to import it, running into issues (e.g, non-standard nulls, problematic column names, encoding issues), fixing one problem, and then encountering the next, all you need to do is run a single command from the command line:
$ retriever install csv iris $ retriever install sqlite breed-bird-survey -f bbs.sqlite
or from R:
>>> rdataretriever::install('postgres', 'wine-quality') >>> portal_data <- rdataretriever::fetch('portal')
The Data Retriever uses information in Frictionless Data datapackage.json files to automatically handle all of the complexities of “simple” data for you. For more complicated complicated datasets, with dozens of components or major data structure issues, the Retriever uses Python scripts as plugins to handle the major data cleaning work and then automatically handles the rest.
Expanded focus and name change
For those of you familiar with the EcoData Retriever, this is the same software with a new name. Challenges with the data end of the analysis pipeline occur across disciplines and our tools work just as well for non-ecological data, so we’ve started adding non-ecological data and changed our name to reflect that. We’d love to hear from anyone interested in leading a push to add data from another discipline or just interested in adding a single favorite dataset.
As part of this we’ve changed the name of the R package from
The 2.0 release includes a number of major changes including:
- Python 3 support (a single code base runs on both Python 2 and 3)
- Adoption of the frictionless data datapackage.json standard (replacing our old YAML like metadata system), including a command line interface for creating and editing datapackage.json files
- Add json and xml as available output formats
- Major expansion of the documentation and hosting of the documentation at Read the Docs
- Remove the graphical user interface (to allow us to focus that development time on wrappers for other languages)
- Lots of work under the hood and major improvements in testing
- Broaden scope to include non-ecological data
We are also in the process of releasing version 1.0 of the R package. This version adds the new features in the Data Retriever and also includes major stability improvements, in particular in RStudio and on Windows.
We also have a brand new website.
Upgrading to the new version (UPDATED)
To ensure the smoothest upgrade to the new version we recommend:
retriever reset scriptsfrom the command line
- Uninstall the old version of the EcoData Retriever
- Install the new version
retriever updatefrom the command line
Henry Senyondo is the lead developer for the Data Retriever and has done an amazing job over the past year developing new features and shoring up the fundamentals for the software. He lead the work on 2.0 start to finish.
Akash Goel was a Google Summer of Code student with the project last summer and was responsible for the majority of the work adding Python 3 support and switching the project over to the
Dan McGlinn, the creator of the R package, has continued his excellent leadership of the development of this package. Shawn Taylor, a new contributor, was instrumental in solving the stability issues on Windows/RStudio.
In addition to these core folks our growing group of contributors to both projects have been invaluable for adding new functionality, fixing bugs, and testing new changes. We are super excited to have contributions from 30 different people and will keep working hard to make sure that everyone feels welcome and supported in contributing to the project.
The level of work done to get these releases out the door was only possible due to generous support of the Gordon and Betty Moore Foundation’s Data Driven Discovery Initiative. This support allowed my group to employ Henry as a full time software engineer to work on these and other projects. This kind of active support for the development and maintenance of research oriented software makes sustainable software development at universities possible.
Last week Zack Brym and I formally announced a semester long Data Carpentry course that we’ve have been building over the last year. One of the things I’m most excited about in this effort is our attempt to support collaborative lesson development for university/college coursework.
I’ve experience first hand the potential for this sort of collaborative lesson development though the development of workshop lessons in Software Carpentry and Data Carpentry. Many of the workshop lessons developed by these two organizations now have 100+ contributors. As far as I’m aware, Software Carpentry was the first demonstration that large-scale open collaboration on lessons could work (but I’d love to hear of earlier examples if folks are aware of them) and it has resulted in what is widely regarded as really high quality lesson material. Having seen this work so effectively for workshops, I’m interested in seeing how well it can work for full length courses.
Most college and university courses that I’m aware of start in one of three ways: 1) someone sites down and develops a course completely from scratch; 2) the course directly follows a text book; or 3) a new professor inherits a course from the person who taught it previously and adapts it.
Developing a course from scratch, even one following a text book fairly closely, is a huge time commitment. In contrast, with collaboratively developed courses new faculty, or faculty teaching new courses, wouldn’t need to start from scratch. They would be able to pick up an existing course to adapt and improve. I can’t even begin to describe how much easier this would have made my first few years as a faculty member. More generally, if we are teaching similar courses across dozens or hundreds of universities, it is much more efficient to share the effort of building and improving those courses than to have each person who teaches them do so independently.
In addition to the time and energy, there are often a lot of things that don’t work well the first time you teach a course and it typically takes a few rounds of teaching it to figure what works best. One of the challenges of developing lessons in isolation is that you only teach a class every 1 or 2 years. This makes it hard and slow to figure out what needs work. In contrast, a collaboratively developed course might be taught dozens or hundreds of times each year, allowing the course to be improved much more rapidly through large scale sampling and discussion of what works and what doesn’t. In addition to having more information, the fact that faculty are spending less time developing courses from scratch should leave them with more time for improving the materials. In combination this results in the potential for higher quality courses across institutions.
By involving large numbers of lesson developers, collaborative development also has the potential to help make courses more accurate, more up-to-date, and more approachable by novices. More lesson developers means a greater chance of having an expert on any particular topic involved, thus making the material more accurate and reducing the amount of bad practice/knowledge that gets taught. New faculty with more recent training on the development team can help keep both the material and the pedagogical practices up-to-date (this is hard when the same person teaches the same course for 20 years). More lesson developers also increases the likelihood that someone who isn’t an expert in any given piece of material is also involved, which should help make sure that the lesson avoids issues with expert blindness, thus making the material more accessible to students.
Collaborative college/university lesson development will not be without challenges. The skills required for collaborative lesson development in the style of Software and Data Carpentry require proficiency with computational approaches not familiar to many academics. The necessary skills include things like version control, developing materials in markdown, and working with static site generators like Jekyll. This means this approach is currently most accessible for those with some computational training and may initially work best for computing focused courses. In addition, organizing open collaborations takes time and energy, as does collectively deciding on how to design and update classes. Universities and colleges are not typically good at valuing time invested in non-traditional efforts and that would need to change to help support those managing development of courses with large numbers of faculty involved. More substantial may be the fact that faculty are not used to collaborating with other people on course development and are therefore not used to compromising and negotiating what should go into a course. This can be compensated for to some degree by making courses easy to modify and customize, as we’ve tried to do with the Data Carpentry Semester course, but ultimately there will still need to be a shift from prioritizing the personal desires of the faculty member to the best interests of the course more broadly. This approach will likely work best where there are a number of places that all want to teach the same general material.
Is it time? When I built my first version of Programming for Biologists back in 2010 I was really excited about the potential for collaborative open course development. I built the course using Drupal, emailed a bunch of my friends who were teaching similar courses and said “Hey, we should work together on this stuff”, and stuck some welcoming language on the homepage. Nothing happened. A few years later I was on sabbatical at the University of North Carolina and got the opportunity to talk a fair bit with Elliot Hauser who was part of a team trying to encourage this through a start-up called Coursefork. I was somewhat skeptical that this approach would work broadly at the time, but I thought it was really awesome that they were trying. They ended up pivoting to focus on helping computing education through a somewhat different route and became trinket. A couple of years I converted my course to Jekyll on GitHub and told a lot of people about it. There was much excited. Still nothing happened. So why might this work now? I think there are three things that increase the possibility of this becoming a bigger deal going forward. First, open source software development is becoming more frequent in academia. It still isn’t rewarded anywhere close to sufficiently, but the ethos of using and contributing to collaboratively developed tools is growing. Second, the technical tools that make this kind of collaboration easier are becoming more widely used and easier to learn through training efforts like Software Carpentry. Third, more and more people are actively developing university courses using these tools and making them available under open source licenses. Two of my favorites are Jenny Bryan’s Stat 545 and Karl Broman’s Tools for Reproducible Research. Our development of the Data Carpentry semester course has already benefited from using openly available materials like these and feedback from members of the computational teaching community. I guess we’ll see what happens next.
This post benefited from a number of comments and suggestions by Zack Brym, who has also played a central and absolutely essential role in the development of the Data Carpentry semester long course. The post also benefited from several conversations with Tracy Teal, the Executive Director of Data Carpentry about the potential value of these approaches for college courses
Over the last year and a half we have been actively developing a semester-long Data Carpentry course designed to be easily customized and integrated into existing graduate and undergraduate curricula.
Data Carpentry for Biologists contains course materials for teaching scientists how to work more effectively with data. The course provides introductions to data management and relational databases, data manipulation and analysis, and data visualization. It covers the same general types of material as a two-day Data Carpentry workshop, but expands the materials and opportunities for practice into a full-length university course. The teaching material uses R and SQLite, with some corresponding materials for Python as well. To help students understand the direct applications to their interests, the examples and exercises focus on biological questions and working with real data. The course emphasizes using best practices to produce reusable and reproducible data analysis.
Active-learning Teaching Materials
Learning computing requires active practice by working through programming problems. Just diving in to computing is challenging for most scientists, so the course instruction is designed to combine short live-coding introductions to concepts followed immediately by the students working on a related exercise. Additional exercises are assigned later for practice. This follows the “I do”, “We do”, “You do” approach to teaching, which leverages the benefits of active-learning and flipped classrooms without leaving students who are less comfortable with the material feeling lost. The bulk of class time is spent working on assigned exercises with the instructor moving around the room helping guide students through things they don’t understand and engaging with students who are thinking about advanced applications of what they’ve learned.
This approach is the result of lots of reading about effective teaching methods and Ethan’s experience teaching this and related courses over the last six years at Utah State University and the University of Florida. It seems to work well for both students that get the material easily and those that find it more challenging. We’ve also tried to make these materials as useful as possible for self-guided students.
Open course development
Software Carpentry and Data Carpentry have shown how powerful collaborative lesson development can be and we’re interested in bringing that to the university classroom. We have designed the course materials to be modular and easy to modify, and the course website easy to clone and set up. All of the teaching materials and associated website files are openly available at the Data Carpentry for Biologists repository on GitHub under CC-BY and MIT licenses. The course materials are all written in Markdown and everything runs on Jekyll through GitHub Pages. Making your own version of the course should take less than an hour. We’ve developed documentation for how to create your own version of the course and how to contribute to development. Exercises and assignments are modular and changing exercises and assignments simply involves reordering items in a list. Adding a new exercise involves creating a new Markdown file and then adding its title to the list of exercises for an assignment.
If you teach, or want to teach, a course like this, we’d love to get you involved. Here are some useful links for getting started.
We want to be sure getting involved is as easy as possible. We’ve worked hard to provide documentation and help resources for students and instructors. Students can find all they need to know at our student start guide. Instructors have access to course content and site design documentation.
Development of this course was generously support by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.
We are very exited to announce the newest release of the EcoData Retriever, our software for automating the downloading, cleaning, and installing of ecological and environmental data. Instead of hours or days trying to get complicated datasets like the Breeding Bird Survey ready for analysis, the Retriever lets you simply click a button or run a single command from R or the command line, and your computer does the rest.
It’s been over a year since the last retriever release and there are lots of new features and improvements to be excited about.
- We’ve added 21 new datasets, including major ecological and environmental datasets like eBird, Vertnet, and the Global Wood Density Database, and the PRISM climate data.
- To support all of these datasets we’ve added support for additional data types including greater than memory archive files, and we’ve also improved the ability to control where downloaded files are stored and how they are clustered together.
- We’ve significantly improved documentation and now have a new automatically built documentation site at Read The Docs.
- We’ve also made a lot of under the hood improvements.
This is also the first release that has been overseen by Weecology’s new software engineer, Henry Senyondo. We’re excited to have Henry on the team, and now that he’s around development of both the EcoData Retriever and other lab software projects will be happening more quickly.
A big thanks to the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative for funding this development through Grant GBMF4563 and to the National Science Foundation for funding as part of a CAREER award to Ethan White.
data <- ecoretriever::fetch("BBS")
I’m looking for one or more graduate students to join my group next fall. In addition to the official add (below) I’d like to add a few extra thoughts. As Morgan Ernest noted in her recent ad, we have a relatively unique setup at Weecology in that we interact actively with members of the Ernest Lab. We share space, have joint lab meetings, and generally maintain a very close intellectual relationship. We do this with the goal of breaking down the barriers between the quantitative side of ecology and the field/lab side of ecology. Our goal is to train scientists who span these barriers in a way that allows them to tackle interesting and important questions.
I also believe it’s important to train students for multiple potential career paths. Members of my lab have gone on to faculty positions, postdocs, and jobs in both science non-profits and the software industry.
Scientists in my group regularly both write papers (e.g., these recent papers from dissertation chapters: Locey & White 2013, Xiao et al. 2014) and develop or contribute to software (e.g., EcoData Retriever, ecoretriever, rpartitions & pypartitions) even if they’ve never coded before they joined my lab.
My group generally works on problems at the population, community, and ecosystem levels of ecology. You can find out more about what we’ve been up to by checking out our website. If you’re interested in learning more about where the lab is headed I recommend reading my recently funded Moore Investigator in Data-Driven Discovery proposal.
PH.D STUDENT OPENINGS IN QUANTITATIVE, COMPUTATIONAL, AND MACRO- ECOLOGY
The White Lab at the University of Florida has openings for one or more PhD students in quantitative, computational, and/or macro- ecology to start fall 2015. The student(s) will be supported as graduate research assistants from a combination of NSF, Moore Foundation, and University of Florida sources depending on their research interests.
The White Lab uses computational, mathematical, and advanced statistical/machine learning methods to understand and make predictions/forecasts for ecological systems using large amounts of data. Background in quantitative and computational techniques is not necessary, only an interest in learning and applying them. Students are encouraged to develop their own research projects related to their interests.
The White Lab is currently at Utah State University, but is moving to the Department of Wildlife Ecology and Conservation at the University of Florida starting summer 2015.
Interested students should contact Dr. Ethan White (email@example.com) by Nov 15th, 2014 with their CV, GRE scores, and a brief statement of research interests.
UPDATE: Added a note that we work at population, community, and ecosystem levels.
We are very excited to announce the newest release of our EcoData Retriever software and the first release of a supporting R package, ecoretriever. If you’re not familiar with the EcoData Retriever you can read more here.
The biggest improvement to the Retriever in this set of releases is the ability to run it directly from R. Dan McGlinn did a great job leading the development of this package and we got ton of fantastic help from the folks at rOpenSci (most notably Scott Chamberlain, Gavin Simpson, and Karthik Ram). Now, once you install the main EcoData Retriever, you can run it from inside R by doing things like:
install.packages('ecoretriever') library(ecoretriever) # List the datasets available via the Retriever ecoretriever::datasets() # Install the Gentry dataset into csv files in your working directory ecoretriever::install('Gentry', 'csv') # Download the raw Gentry dataset files, without any processing, # to the subdirectory named data ecoretriever::download('Gentry', './data/') # Install and load a dataset as a list Gentry = ecoretriever::fetch('Gentry') names(Gentry) head(Gentry$counts)
The other big advance in this release is the ability to have the Retriever directly download files instead of processing them. This allows us to support data that doesn’t come in standard tabular forms. So, we can now include things like environmental data in GIS formats and phylogenetic data such as supertrees. We’ve used this new capability to allow the automatic downloading of the Bioclim data, one of the most widely used climate datasets in ecology, and the supertree for mammals from Fritz et al. 2009.
EcoData Retriever: quickly download and cleanup ecological data so you can get back to doing science
If you’ve every worked with scientific data, your own or someone elses, you know that you can end up spending a lot of time just cleaning up the data and getting it in a state that makes it ready for analysis. This involves everything from cleaning up non-standard nulls values to completely restructuring the data so that tools like R, Python, and database management systems (e.g., MS Access, PostgreSQL) know how to work with them. Doing this for one dataset can be a lot of work and if you work with a number of different databases like I do the time and energy can really take away from the time you have to actually do science.
Over the last few years Ben Morris and I been working on a project called the EcoData Retriever to make this process easier and more repeatable for ecologists. With a click of a button, or a single call from the command line, the Retriever will download an ecological dataset, clean it up, restructure and assemble it (if necessary) and install it into your database management system of choice (including MS Access, PostgreSQL, MySQL, or SQLite) or provide you with CSV files to load into R, Python, or Excel.
Just click on the box to get the data:
Or run a command like this from the command line:
retriever install msaccess BBS --file myaccessdb.accdb
This means that instead of spending a couple of days wrangling a large dataset like the North American Breeding Bird Survey into a state where you can do some science, you just ask the Retriever to take care of it for you. If you work actively with Breeding Bird Survey data and you always like to use the most up to date version with the newest data and the latest error corrections, this can save you a couple of days a year. If you also work with some of the other complicated ecological datasets like Forest Inventory and Analysis and Alwyn Gentry’s Forest Transect data, the time savings can easily be a week.
The Retriever handles things like:
- Creating the underlying database structures
- Automatically determining delimiters and data types
- Downloading the data (and if there are over 100 data files that can be a lot of clicks)
- Transforming data into standard structures so that common tools in R and Python and relational database management systems know how to work with it (e.g., converting cross-tabulated data)
- Converting non-standard null values (e.g., 999.0, -999, NoData) into standard ones
- Combining multiple data files into single tables
- Placing all related tables in a single database or schema
The EcoData Retriever currently includes a number of large, openly available, ecological datasets (see a full list here). It’s also easy to add new datasets to the EcoData Retriever if you want to. For simple data tables a Retriever script can be as simple as:
name: Name of the dataset description: A brief description of the dataset of ~25 words. shortname: A one word name for the dataset table: MyTableName, http://awesomedatasource.com/dataset
We also have some exciting new features on the To Do list including:
- Automatically cleaning up the taxonomy using existing services
- Providing detailed tracking of the provenance of your data by recording the date it was downloaded, the version of the software used, and information about what cleanup steps the Retriever performed
- Integration into R and Python
Let us know what you think we should work on next in the comments.
As a budding macroecologist, I have thought a lot about what skills I need to acquire during my Ph.D. This is my model of the four basic attributes for a macroecologist, although I think it is more generally applicable to many ecologists as well:
- Knowledge of SQL
- Dealing with proper database format and structure
- Finding data
- Appropriate treatments of data
- Understanding what good data are
- Monte Carlo methods
- Maximum likelihood methods
- Power analysis
- Higher level calculus
- Should be able to derive analytical solutions for problems
- Should be able to write programs for analysis, not just simple statistics and simple graphs.
- Able to use version control
- Once you can program in one language, you should be able to program in other languages without much effort, but should be fluent in at least one language.
Achieve expertise in at least 2 out of the 4 basic areas, but be able to communicate with people who have skills in the other areas. However, if you are good at collaboration and come up with really good questions, you can make up for skill deficiencies by collaborating with others who possess those skills. Start with smaller collaborations with the people in your lab, then expand outside your lab or increase the number of collaborators as your collaboration skills improve.
Achieving proficiency in an area is best done by using it for a project that you are interested in. The more you struggle with something, the better you understand it eventually, so working on a project is a better way to learn than trying to learn by completing exercises.
The attribute should be generalizable to other problems: For example, if you need to learn maximum likelihood for your project, you should understand how to apply it to other questions. If you need to run an SQL query to get data from one database, you should understand how to write an SQL query to get data from a different database.
In graduate school:
Someone who wants to compile their own data or work with existing data sets needs to develop a good intuitive feel for data; even if they cannot write SQL code, they need to understand what good and bad databases look like and develop a good sense for questionable data, and how known issues with data could affect the appropriateness of data for a given question. The data skill is also useful if a student is collecting field data, because a little bit of thought before data collection goes a long way toward preventing problems later on.
A student who is getting a terminal master’s and is planning on using pre-existing data should probably be focusing on the data skill (because data is a highly marketable skill, and understanding data prevents major mistakes). If the data are not coming from a central database, like the BBS, where the quality of the data is known, additional time will have to be added for time to compile data, time to clean the data, and time to figure out if the data can be used responsibly, and time to fill holes in the data.
Master’s students who want to go on for a Ph.D. should decide what questions they are interested in and should try to pick a project that focuses on learning a good skill that will give them a headstart- more empirical (programming or stats), more theoretical (math), more applied (math (e.g., for developing models), stats(e.g., applying pre-existing models and evaluating models, etc.), or programming (e.g. making tools for people to use)).
Ph.D. students need to figure out what types of questions they are interested in, and learn those skills that will allow them to answer those questions. Don’t learn a skill because it is trendy or you think it will help you get a job later if you don’t actually want to use that skill. Conversely, don’t shy away from learning a skill if it is essential for you to pursue the questions you are interested in.
Right now, as a Ph.D. student, I am specializing in data and programming. I speak enough math and stats that I can communicate with other scientists and learn the specific analytical techniques I need for a given project. For my interests (testing questions with large datasets), I think that by the time I am done with my Ph.D., I will have the skills I need to be fairly independent with my research.
Slides and script from Ethan White’s Ignite talk on Big Data in Ecology from Sandra Chung and Jacquelyn Gill‘s excellent ESA 2013 session on Sharing Makes Science Better. Slides are also archived on figshare.
1. I’m here to talk to you about the use of big data in ecology and to help motivate a lot of the great tools and approaches that other folks will talk about later in the session.
2. The definition of big is of course relative, and so when we talk about big data in ecology we typically mean big relative to our standard approaches based on observations and experiments conducted by single investigators or small teams.
3. And for those of you who prefer a more precise definition, my friend Michael Weiser defines big data and ecoinformatics as involving anything that can’t be successfully opened in Microsoft Excel.
4. Data can be of unusually large size in two ways. It can be inherently large, like citizen science efforts such as Breeding Bird Survey, where large amounts of data are collected in a consistent manner.
5. Or it can be large because it’s composed of a large number of small datasets that are compiled from sources like Dryad, figshare, and Ecological Archives to form useful compilation datasets for analysis.
6. We have increasing amounts of both kinds of data in ecology as a result of both major data collection efforts and an increased emphasis on sharing data.
7-8. But what does this kind of data buy us. First, big data allows us to work at scales beyond those at which traditional approaches are typically feasible. This is critical because many of the most pressing issues in ecology including climate change, biodiversity, and invasive species operate at broad spatial and long temporal scales.
9-10. Second, big data allows us to answer questions in general ways, so that we get the answer today instead of waiting a decade to gradually compile enough results to reach concensus. We can do this by testing theories using large amounts of data from across ecosystems and taxonomic groups, so that we know that our results are general, and not specific to a single system (e.g., White et al. 2012).
11. This is the promise of big data in ecology, but realizing this potential is difficult because working with either truly big data or data compilations is inherently challenging, and we still lack sufficient data to answer many important questions.
12. This means that if we are going to take full advantage of big data in ecology we need 3 things. Training in computational methods for ecologists, tools to make it easier to work with existing data, and more data.
13. We need to train ecologists in the computational tools needed for working with big data, and there are an increasing number of efforts to do this including Software Carpentry (which I’m actively involved in) as well as training initiatives at many of the data and synthesis centers.
14. We need systems for storing, distributing, and searching data like DataONE, Dryad, NEON‘s data portal, as well as the standardized metadata and associated tools that make finding data to answer a particular research question easier.
15. We need crowd-sourced systems like the Ecological Data Wiki to allow us to work together on improving insufficient metadata and understanding what kinds of analyses are appropriate for different datasets and how to conduct them rigorously.
16. We need tools for quickly and easily accessing data like rOpenSci and the EcoData Retriever so that we can spend our time thinking and analyzing data rather than figuring out how to access it and restructure it.
17. We also need systems that help turn small data into big data compilations, whether it be through centralized standardized databases like GBIF or tools that pull data together from disparate sources like Map of Life.
18. And finally we we need to continue to share more and more data and share it in useful ways. With the good formats, standardized metadata, and open licenses that make it easy to work with.
19. And so, what I would like to leave you with is that we live in an exciting time in ecology thanks to the generation of large amounts of data by citizen science projects, exciting federal efforts like NEON, and a shift in scientific culture towards sharing data openly.
20. If we can train ecologists to work with and combine existing tools in interesting ways, it will let us combine datasets spanning the surface of the globe and diversity of life to make meaningful predictions about ecological systems.
We have all bemoaned the increasing difficulty of keeping up with the growing body of literature. Many of us, me included, have been relying increasingly on following only a subset of journals, but with the growing popularity of the large open-access journals I know I for one am increasingly likely to miss papers. The purpose of this post isn’t to give you the panacea to your problems (sadly I don’t think there is a panacea to this issue, though I have hopes that someone will come up with something viable in the future). The purpose of this post is to let you know about an interesting addition or alternative (for the brave) to the frantic scanning of the table of contents or RSS feeds: Google Scholar.
Almost everyone at this point knows you can go to Google Scholar and search for key words and it’ll produce a list of papers. Did you also know that you can set up a google profile with your published articles and that Google can use that to find articles that might be of interest to you. How does it do that? I’ll have to quote Google’s Blog because it’s a little like voodoo to me (obviously this is Morgan writing this post, not Ethan): “We determine relevance using a statistical model that incorporates what your work is about, the citation graph between articles, the fact that interests can change over time, and the authors you work with and cite. “ When you go to Google Scholar’s homepage (and you’re logged in as you) it’ll notify you if there are new articles on your suggested list. I actually have been pleasantly surprised by the articles it has identified for me, including some book chapters I would never have seen. For example here’s several things that sound really interesting to me, but I would never have seen:
MC Emmerson – Marine Biodiversity and Ecosystem Functioning: …, 2012 – books.google.com
A Potochnik, B McGill – Philosophy of Science, 2012 – JSTOR
D West, J BRUCE – International Journal of Modern Physics B, 2012 – World Scientific
It doesn’t just search published journal articles. For example there are preprints from arXiv and government reports on my list. I don’t know if this would work as well for the young graduate students/postdocs since it uses the citations in your existing papers and our junior colleagues might have less data for Google to work with. However, once you have a profile, you can also follow other people who have profiles, which means you get an email every time scholarly work gets added to their profile. Are you a huge Simon Levin groupie? You can follow him and every time a paper gets added to his profile, you can get an email alerting you about the new paper. I also use this to follow a bunch of interesting younger people because they often publish less frequently or in journals I don’t happen to follow and this way I don’t miss their stuff when my Google Reader hits 1000+ articles to be perused! You can also sign up for alerts when someone you follow has their work cited. (And you can set up alerts for when your own work gets cited as well).
As I said before, I don’t think Google Scholar is a one-stop literature monitoring stop (yet), but I find it useful for getting me out of my high impact factor monitoring rut. The only thing you need to do is set up your Google Scholar profile and the only reason not to do that is if you’re worried it’ll give Google the edge when it finally becomes self-aware and renames itself Skynet (ha ha ha ha….hmmm).