The Ecological Database Toolkit
Large amounts of ecological and environmental data are becoming increasingly available due to initiatives sponsoring the collection of large-scale data and efforts to increase the publication of already collected datasets. As a result, ecology is entering an era where progress will be increasingly limited by the speed at which we can organize and analyze data. To help improve ecologists’ ability to quickly access and analyze data we have been developing software that designs database structures for ecological datasets and then downloads the data, processes it, and installs it into several major database management systems (at the moment we support Microsoft Access, MySQL, PostgreSQL, and SQLite). The database toolkit system can substantially reduce hurdles to scientists using new databases, and save time and reduce import errors for more experienced users.
The database toolkit can download and install small datasets in seconds and large datasets in minutes. Imagine being able to download and import the newest version of the Breeding Bird Survey of North America (a database with 4 major tables and over 5 million records in the main table) in less than five minutes. Instead of spending an afternoon setting up the newest version of the dataset and checking your import for errors you could spend that afternoon working on your research. This is possible right now and we are working on making this possible for as many major public/semi-public ecological databases as possible. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days. We hope that this will make it much more likely that scientists will use multiple datasets in their analyses; allowing them to gain more rapid insight into the generality of the pattern/process they are studying.
We need your help
We have done quite a bit of testing on this system including building in automated tests based on manual imports of most of the currently available databases, but there are always bugs and imperfections in code that cannot be identified until the software is used in real world situations. That’s why we’re looking for folks to come try out the Database Toolkit and let us know what works and what doesn’t, what they’d like to see added or taken away, and if/when the system fails to work properly. So if you’ve got a few minutes to have half a dozen ecological databases automatically installed on your computer for you stop by the Database Toolkit page at EcologicalData.org, give it a try, and let us know what you think.
We have a postdoc position available for someone interested in the general areas of macroecology, quantitative ecology, and ecoinformatics. Here’s the short ad with links to the full job description:
Ethan White’s lab at Utah State University is looking for a postdoc to collaborate on research studying approaches for unifying macroecological patterns (e.g., species abundance distributions and species-area relationships) and predicting variation in these patterns using ecological and environmental variables. The project aims to 1) evaluate the performance of models that link ecological patterns by using broad scale data on at least three major taxonomic groups (birds, plants, and mammals); and 2) combine models with ecological and environmental factors to explain continental scale variation in community structure. Models to be explored include maximum entropy models, neutral models, fractal based models, and statistical models. The postdoc will also be involved in an ecoinformatics initiative developing tools to facilitate the use of existing ecological data. There will be ample opportunity for independent and collaborative research in related areas of macroecology, community ecology, theoretical ecology, and ecoinformatics. The postdoc will benefit from interactions with researchers in Dr. White’s lab, the Weecology Interdisciplinary Research Group, and with Dr. John Harte’s lab at the University of California Berkeley. Applicants from a variety of backgrounds including ecology, mathematics, statistics, physics and computer science are encouraged to apply. The position is available for 1 year with the possibility for renewal depending on performance, and could begin as early as September 2010 and no later than May 2011. Applications will begin to be considered starting on September 1, 2010. Go to the USU job page to see the full advertisement and to apply.
If you’re interested in the position and are planning to be at ESA please leave a comment or drop me an email (email@example.com) and we can try to set up a time to talk while we’re in Pittsburgh. Questions about the position and expressions of interest are also welcome.
UPDATE: This position has been filled.
Frequency distributions for ecologists V: Don’t let the lack of a perfect tool prevent you from asking interesting questions
I had an interesting conversation with someone the other day that made me think I needed one last frequency distribution post in order to avoid causing some people to not move forward with addressing interesting questions.
As a quantitative ecologist I spent a fair amount of time trying to figure out the best way to do things. In other words, I often want to know what the best method is available for answering a particular question. When I think I’ve figured this out I (sometimes, if I have the energy) try to communicate the best methodology more broadly to encourage good practice and accurate answers to questions of interest to ecologists. In some cases finding the best approach is fairly easy. For example, likelihood based methods for fitting and comparing simple frequency distributions are often straightforward and can be easily looked up online. However, in many cases the methodological challenges are more substantial, or the question being asked is not general enough that the methods have been worked out and clearly presented. This happens in the case of frequency distributions when one needs non-standard minimum and maximum values (a common case in ecological studies) or when one needs discrete analogs of traditionally continuous distributions. It’s not that these cases can’t be addressed, it’s just that you can’t look the solutions up on Wikipedia.
So, what is someone without a sufficient background to do (and, btw, that might be all of us if the problem is really hard or even… intractable). First, I’d recommend trying to ask for help. Talk to a statistician at your university or a quantitative colleague and see if they can help you figure things out. I am always pleased to try to help out because I always learn something in the process. Then, if that fails, just do something. Morgan and I will probably write more about this later, but please, please, please don’t let the questions you ask as an ecologists be defined by the availability of an ideal statistical methodology that is easy to implement. In the context of the current series of posts, if you are trying to do something with a more complex frequency distribution and you can’t find a solution to your problem using likelihood then use something else. If it was me I’d go with either normalized logarithmic binning or something based on the CDF as these methods can behave reasonably well. Sure, people like me may complain, but that’s fine. Just make clear that you are aware of the potential weaknesses and that you did what you did because you couldn’t figure out an appropriate alternative approach. That way you still get to make progress on the question of interest and you may motivate people to help work on developing better methods. Sure, you might be the presenting the “right” answer, but then I very much doubt that we ever are when studying ecological systems anyway.
This is a table of contents of sorts for five posts on the visualization, fitting, and comparison of frequency distributions. The goal of these posts is to expose ecologists to the ideas and language related to good statistical practices for addressing frequency distribution data. The focus is on simple distributions and likelihood methods. The information provided here is far from comprehensive, but my aim is to give readers a good place to start when exploring this kind of data.
Likelihood, likelihood, likelihood (and maybe some other complicated approaches), but definitely not r^2 values from fitting regressions to binned data.
A bit more nitty gritty detail
In addition to causing issues with parameter estimation, binning based methods are also inappropriate when trying to determine which distribution provides the best fit to empirical data. As a result you won’t find any card carrying statistician recommending this approach. Basically binning and fitting regressions ignores the very nature of this kind of data generating bizarre error structures and making measures of model fit arbitrary and ungrounded in statistical theory. This isn’t something that is controversial in anyway. It is not “hotly contested” or open to debate despite what you may read in the ecological literature (i.e., Reynolds & Rhodes 2009), and you can’t (well, at least you shouldn’t) choose to use binning based methods just because someone else did (i.e., Maestre & Escudero 2009)*.
So, to be rigorous you want to use a more appropriate framework, which again should be likelihood (or Bayes; or something more complicated that I know nothing about; but if you’re taking the time to read this article you should probably start with likelihood). To determine the likelihood of a model given the data you simply take the product of the probability density function (pdf) evaluated at each value of x (put each value of x into the equation for the pdf with the parameters set to the maximum likelihood estimates and then multiply all of the resulting values together). Having done this you can use a likelihood ratio test to compare two distributions (if you’re into p-values this is for you) or you can use model selection based on an information criterion like AIC. With the likelihood in hand the AIC is then just 2k-2*ln(likelihood) (where k is the number of parameters). In practice you’ll probably want to calculate the ln(likelihood) to start (otherwise the values get really small and you’ll run into precision issues) so you would typically take the sum of the log of the pdf instead of the product described above. Andy Edwards 2007 Nature paper does a nice job of talking about this in the context of Levy Flights. It’s worth keeping in mind that the details can have an important influence here, so you’ll want to be sure that your pdfs have appropriately defined minimum and maximum and satisfy the other limitations on the parameters as well. This approach will yield valid AICs for comparing models. In contrast the AIC values in another recent Nature paper (which are based on binning the data, fitting regressions, and then estimating the likelihoods of those regressions) are not grounded in probability in the same way and in my opinion are not appropriate (at least without some Monte Carlo work to show that they at least perform well).
I’m not trying to give anyone a hard time about what they’ve done in the past. There really is a failure of education and discussion regarding how to deal with distributions in ecology. That said, now that the discussion of these issues has started to reach the broad ecological population we do need to be careful about unnecessarily and inappropriately fomenting a statistical controversy that doesn’t exist, so that we can move towards the use and refinement of the most rigorous methods available.
If you’re looking for a good introduction to this area I highly recommend The Ecological Detective by Hilborn & Mangel. If you’re looking for something with more advanced material and technical detail I like In All Likelihood by Pawitan. I’ve also heard good things about Benjamin Bolker’s new book, but I have not yet read it myself.
*NB: I haven’t conducted any Monte Carlo work on this myself like I have for parameter estimation, but I have read quite a bit of statistical literature in this area and if you do the same I think you will find that statisticians don’t even consider the possibility of binning and fitting regressions, because it is so obviously disconnected from the question at hand.
Don’t bin you’re data and fit a regression. Don’t use the CDF and fit a regression. Use maximum likelihood or other statistically grounded approaches that can typically be looked up on Wikipedia.
A bit more detail
OK, so you’ve visualized your data and after playing around a bit you have an idea of what the basic functional form of the model is. Now you want to estimate the parameters. So, for example, you’ve log-transformed the axes and you’re distribution is approximately a straight line so you think it’s a power-law and you want to estimate the exponent of the power-law (more about figuring out if this is actually the best model for you data in the next and final installment).
Ecologists typically fall back on what they know from an introductory statistics class or two to solve this problem – regression. They bin the data, count how many points occur for each binned set of x values and use those numbers for the y data and the bin center for the x data. They then fit a regression to these points to estimate parameters. You can’t blame us as a group for this approach because we typically don’t received training in the proper methods and we’re just doing out best. However, this approach is not valid and can yield very poor parameter estimates in some cases.
The best approach actually varies a bit depending on the specific form of the function, but generally what you want to do is maximum likelihood estimation (MLE). This approach basically determines the values of the free parameters that are most likely (i.e., have the greatest probability) of producing the observed data. Technically what you really want is actually the minimum variance unbiased estimator (MVUE), which will often be a slightly modified form of the maximum likelihood estimate. The bad news is that most ecologists won’t want to calculate the MLE themselves. The good news is that equations for calculating the MLE (or the MVUE) are readily available, and once you have the equation this approach is actually far less work than binning and fitting lines to data. For most common distributions you can just google the name of the distribution. The first link will typically be the Wikipedia entry. Just go to the Wikipedia page, scroll down to Parameter estimation and use the equation provided (and yes Wikipedia is quite reliable for this kind of information). Even if the best approach isn’t maximum likelihood this will give you typically provide you with the best approach (or at least a very good one). If you’re doing something more complicated there is an excellent series of reference books by Norman Johnson and colleagues that covers parameter estimation for most characterized distributions in detail. Some of these solutions will require the use of numerical methods. This is fairly straightforward in R and Matlab, but if that’s a bit much you should be able to find a statistician or quantitatively inclined ecological colleague to help (they’ll be impressed that you got as far as you did).
Recommended reading (focused on power-laws because that’s what I’m most familiar with)
Beyond simple histograms there are two basic methods for visualizing frequency distributions.
Kernel density estimation is basically a generalization of the idea behind histograms. The basic idea is to put an miniature distribution (e.g., a normal distribution) at the position of each individual data point and then add up those distributions to get an estimate of the frequency distribution. This is well developed field with a number of advanced methods for estimating the true form of the underlying frequency distribution. Silverman’s 1981 book is an excellent starting point for those looking for details. I like KDE as a general approach. It has the same issues as binning in that depending on the kernel width (equivalent to bin width) you can get different impressions of the data, but it avoids the issue of having to choose the position of bin edges. It also estimates a continous distribution which is what we are typically looking for.
Cumulative distribution functions (CDF for short) characterize the probability that a randomly drawn value of the x-variable will be less than or equal to a certain value of x. The CDF is conveniently the integral of the probability density function, so if you understand the CDF then simply taking it’s derivative will give you the frequency distribution. The CDF is nice because there are no choices that need to be made to construct a CDF for empirical data. No bin widths, no kernel widths, nothing. All you have to do is rank the n observed values of x from smallest to largest (i=1 . . . n). The probability that an observation is less than or equal to x (the CDF) is then estimated as i/n. If all of this seems a bit esoteric, just think about the rank-abundance distribution. This is basically just a CDF of the abundance distribution with the axes flipped (and one of them rescaled). Because of the objectivity of this approach it has recently been suggested that this is the best approach to visualizing species-abundance distributions. I’ve spent a fair bit of time working with CDFs for exactly this reason. The problem that I have run into with this approach is that except in certain straightforward situations it can be difficult/counterintuitive to grasp what a given CDF translates into with respect to the frequency distribution of interest.
The take home message for visualization is that any of the available approaches done well and properly understood is perfectly satisfactory for visualizing frequency distribution data.
Well, I guess that grant season was a bit of an optimistic time to try to do a 4 part series on frequency distributions, but I’ve got a few minutes before heading off to an all day child birth class so I thought I’d see if I could squeeze in part 2.
OK, so you have some data and you’d like to get a rough visual idea of its frequency distribution. What do you do know? There are 3 basic approaches that I’ve seen used:
- Histograms. This is certainly the simplest and easiest to understand approach and most of the time for visualizing frequency distributions it is perfectly acceptable. A histogram simply divides the range of possible values for your data into bins, counts the number of values in each bin and plots this count on the y-axis against the center value of the bin on the x-axis. Any statistics program will be able to do this or you can easily do it yourself. If all of the bins are of equal width (as is the default in stats packages) then your basically done. If you want to convert the y-axis into the probability that a value falls in a bin, just divide the counts by the total number of data points. If you want to convert it to a proper probability density estimate then you’ll also want to divide this number by the width of the bin (i.e., the upper edge of the bin minus the lower edge of the bin). If the bins are not equal width (which includes if you have transformed the data in some way) you should divide by the the linear width of the bin regardless of whether you are concerned about turning your y-axis into a probability density estimate or not. This is to make sure that you are visualizing the distribution in the way you are thinking about it (most people are thinking about the distribution of x). Of course there are good reasons for wanting to visualize the distributions of transformed data. Just make sure you have one if you’re not going to divide by the linear width of the bin.
Well, I’m out of time so I’ll go ahead and post this and come back with the other two options for visualization later.
Dealing with frequency distribution data is something that we as ecologists haven’t typically done in a very sophisticated way. This isn’t really our fault. Proper methods aren’t typically taught in undergraduate statistics courses or in the graduate level classes targeted at biologists. That said, as ecology becomes a more quantitative science it becomes increasingly important to analyze data carefully so that we can understand its precise quantitative structure and its relationship to theoretical predictions.
Frequency distribution data is basically any data that you would think about making a histogram out of. Any time you have a single value that you (or someone else) has measured, for example the size or abundance of a species, and you are interested in how the number of occurrences changes as a function of that value, for example – are there more small species than large species or more small patches than large patches, then you are talking about a frequency distribution. Technically what we’re often interested in is the probability distribution underlying the data and you will often have more luck using this term when looking for information. Many major ecological patterns are probability/frequency distributions including the species-abundance distribution, species size distribution (also known as the body size distribution), individual size distribution (also known as the size spectrum), Levy flights, and many others.
Last year I wrote a paper with Jessica Green and Brian Enquist on one of the problems that can result from the approaches to this kind of data typically employed by ecologists and the more sophisticated methods available for addressing the question. As a result I’ve been receiving a fair bit of email recently about related problems; enough that I thought it might be worth a couple of posts to lay out some of the basic ideas regarding the analysis of frequency distribution data. Over the next week or so I’ll try to cover what I’ve learned about basic data visualization, parameter estimation, and comparing the fits of different models to the data. Along the way I may have a couple of things to say about some recently published papers that have the potential to cause confusion with respect to these subject.
Please keep in mind that I am not a professionally trained statistician and that this is not intended to be an authoritative treatment of the subject. I’m just hoping to provide folks with an entryway into thinking about what to do with this kind of data and I’ll try to point to useful references to help take you further if you’re interested.
Nathan over at Flowing Data just posted an interesting piece on the emergence of a new class of scientists whose work focuses on the manipulation, analysis and presentation of data. The take home message is that in order to fully master the ability to understand and communicate patterns in large quantities of data that one needs to have some ability in:
- Computer science – for acquiring, managing and manipulating data
- Mathematics and Statistics – for mining and analyzing data
- Graphic design and Interactive interface design – to present the results of analyses in an easy to understand manner and encourage interaction and additional analysis by less technical users
His point is that while one could get together a group of people (one with each of these skills) to undertake this kind of task, that the challenges of cross-disciplinary collaboration can slow down progress (or even prevent it entirely). As such, there is a need for individuals that have at least some experience in several of these fields to help facilitate the process. I think this is a good model for this kind of work in ecology, though given the already extensive multidisciplinarity required in the field I view this role as one occupied only be fairly small fraction of folks.
The other thing that I really liked about this post (and about Flowing Data’s broader message) is the focus on the end user. The goal is to make ideas and tools available to the broadest possible audience and sometimes often the more technical folks in the biological scientists seem to forget that their goal should be to make things easy to understand and simple for non-technical users to use. This is undoubtedly a challenging task, but one that we should work to accomplish whenever possible.
Data is the sword of the 21st century, those who wield it well, the Samurai.
– Jonathan Rosenberg, SVP, Product Management, Google