Jabberwocky Ecology

Frequency distributions for ecologists III: Fitting model parameters

Summary

Don’t bin you’re data and fit a regression. Don’t use the CDF and fit a regression. Use maximum likelihood or other statistically grounded approaches that can typically be looked up on Wikipedia.

A bit more detail

OK, so you’ve visualized your data and after playing around a bit you have an idea of what the basic functional form of the model is. Now you want to estimate the parameters. So, for example, you’ve log-transformed the axes and you’re distribution is approximately a straight line so you think it’s a power-law and you want to estimate the exponent of the power-law (more about figuring out if this is actually the best model for you data in the next and final installment).

Ecologists typically fall back on what they know from an introductory statistics class or two to solve this problem – regression. They bin the data, count how many points occur for each binned set of x values and use those numbers for the y data and the bin center for the x data. They then fit a regression to these points to estimate parameters. You can’t blame us as a group for this approach because we typically don’t received training in the proper methods and we’re just doing out best. However, this approach is not valid and can yield very poor parameter estimates in some cases.

The best approach actually varies a bit depending on the specific form of the function, but generally what you want to do is maximum likelihood estimation (MLE). This approach basically determines the values of the free parameters that are most likely (i.e., have the greatest probability) of producing the observed data. Technically what you really want is actually the minimum variance unbiased estimator (MVUE), which will often be a slightly modified form of the maximum likelihood estimate. The bad news is that most ecologists won’t want to calculate the MLE themselves. The good news is that equations for calculating the MLE (or the MVUE) are readily available, and once you have the equation this approach is actually far less work than binning and fitting lines to data. For most common distributions you can just google the name of the distribution. The first link will typically be the Wikipedia entry. Just go to the Wikipedia page, scroll down to Parameter estimation and use the equation provided (and yes Wikipedia is quite reliable for this kind of information). Even if the best approach isn’t maximum likelihood this will give you typically provide you with the best approach (or at least a very good one). If you’re doing something more complicated there is an excellent series of reference books by Norman Johnson and colleagues that covers parameter estimation for most characterized distributions in detail. Some of these solutions will require the use of numerical methods. This is fairly straightforward in R and Matlab, but if that’s a bit much you should be able to find a statistician or quantitatively inclined ecological colleague to help (they’ll be impressed that you got as far as you did).

Recommended reading (focused on power-laws because that’s what I’m most familiar with)

Newman et al. 2005, Edwards et al. 2007White et al. 2008, Clauset et al. 2009

Frequency distributions for ecologists IIb: CDFs and Kernel density estimates

Beyond simple histograms there are two basic methods for visualizing frequency distributions.

Kernel density estimation is basically a generalization of the idea behind histograms. The basic idea is to put an miniature distribution (e.g., a normal distribution) at the position of each individual data point and then add up those distributions to get an estimate of the frequency distribution. This is well developed field with a number of advanced methods for estimating the true form of the underlying frequency distribution. Silverman’s 1981 book is an excellent starting point for those looking for details. I like KDE as a general approach. It has the same issues as binning in that depending on the kernel width (equivalent to bin width) you can get different impressions of the data, but it avoids the issue of having to choose the position of bin edges. It also estimates a continous distribution which is what we are typically looking for.

Cumulative distribution functions (CDF for short) characterize the probability that a randomly drawn value of the x-variable will be less than or equal to a certain value of x. The CDF is conveniently the integral of the probability density function, so if you understand the CDF then simply taking it’s derivative will give you the frequency distribution. The CDF is nice because there are no choices that need to be made to construct a CDF for empirical data. No bin widths, no kernel widths, nothing. All you have to do is rank the n observed values of x from smallest to largest (i=1 . . . n). The probability that an observation is less than or equal to x (the CDF) is then estimated as i/n. If all of this seems a bit esoteric, just think about the rank-abundance distribution. This is basically just a CDF of the abundance distribution with the axes flipped (and one of them rescaled). Because of the objectivity of this approach it has recently been suggested that this is the best approach to visualizing species-abundance distributions. I’ve spent a fair bit of time working with CDFs for exactly this reason. The problem that I have run into with this approach is that except in certain straightforward situations it can be difficult/counterintuitive to grasp what a given CDF translates into with respect to the frequency distribution of interest.

To construct the CDF, first rank the n
observed values (xi) from smallest to largest (i ¼ 1 . . . n). The probability that an observation is less than
or equal to xi (the CDF) is then estimated as i/n

The take home message for visualization is that any of the available approaches done well and properly understood is perfectly satisfactory for visualizing frequency distribution data.

Frequency distributions for ecologists IIa: Data Visualization: Histograms

Well, I guess that grant season was a bit of an optimistic time to try to do a 4 part series on frequency distributions, but I’ve got a few minutes before heading off to an all day child birth class so I thought I’d see if I could squeeze in part 2.

OK, so you have some data and you’d like to get a rough visual idea of its frequency distribution. What do you do know? There are 3 basic approaches that I’ve seen used:

  1. Histograms. This is certainly the simplest and easiest to understand approach and most of the time for visualizing frequency distributions it is perfectly acceptable. A histogram simply divides the range of possible values for your data into bins, counts the number of values in each bin and plots this count on the y-axis against the center value of the bin on the x-axis. Any statistics program will be able to do this or  you can easily do it yourself. If all of the bins are of equal width (as is the default in stats packages) then your basically done. If you want to convert the y-axis into the probability that a value falls in a bin, just divide the counts by the total number of data points. If you want to convert it to a proper probability density estimate then you’ll also want to divide this number by the width of the bin (i.e., the upper edge of the bin minus the lower edge of the bin). If the bins are not equal width (which includes if you have transformed the data in some way) you should divide by the the linear width of the bin regardless of whether you are concerned about turning your y-axis into a probability density estimate or not. This is to make sure that you are visualizing the distribution in the way you are thinking about it (most people are thinking about the distribution of x). Of course there are good reasons for wanting to visualize the distributions of transformed data. Just make sure you have one if you’re not going to divide by the linear width of the bin.

Well, I’m out of time so I’ll go ahead and post this and come back with the other two options for visualization later.

Frequency distributions for ecologists I: Introduction

Dealing with frequency distribution data is something that we as ecologists haven’t typically done in a very sophisticated way. This isn’t really our fault. Proper methods aren’t typically taught in undergraduate statistics courses or in the graduate level classes targeted at biologists. That said, as ecology becomes a more quantitative science it becomes increasingly important to analyze data carefully so that we can understand its precise quantitative structure and its relationship to theoretical predictions.

Frequency distribution data is basically any data that you would think about making a histogram out of. Any time you have a single value that you (or someone else) has measured, for example the size or abundance of a species, and you are interested in how the number of occurrences changes as a function of that value, for example – are there more small species than large species or more small patches than large patches, then you are talking about a frequency distribution. Technically what we’re often interested in is the probability distribution underlying the data and you will often have more luck using this term when looking for information. Many major ecological patterns are probability/frequency distributions including the species-abundance distribution, species size distribution (also known as the body size distribution), individual size distribution (also known as the size spectrum), Levy flights, and many others.

Last year I wrote a paper with Jessica Green and Brian Enquist on one of the problems that can result from the approaches to this kind of data typically employed by ecologists and the more sophisticated methods available for addressing the question. As a result I’ve been receiving a fair bit of email recently about related problems; enough that I thought it might be worth a couple of posts to lay out some of the basic ideas regarding the analysis of frequency distribution data. Over the next week or so I’ll try to cover what I’ve learned about basic data visualization, parameter estimation, and comparing the fits of different models to the data. Along the way I may have a couple of things to say about some recently published papers that have the potential to cause confusion with respect to these subject.

Please keep in mind that I am not a professionally trained statistician and that this is not intended to be an authoritative treatment of the subject. I’m just hoping to provide folks with an entryway into thinking about what to do with this kind of data and I’ll try to point to useful references to help take you further if you’re interested.

Data scientists

Nathan over at Flowing Data just posted an interesting piece on the emergence of a new class of scientists whose work focuses on the manipulation, analysis and presentation of data. The take home message is that in order to fully master the ability to understand and communicate patterns in large quantities of data that one needs to have some ability in:

  • Computer science – for acquiring, managing and manipulating data
  • Mathematics and Statistics – for mining and analyzing data
  • Graphic design and Interactive interface design – to present the results of analyses in an easy to understand manner and encourage interaction and additional analysis by less technical users

His point is that while one could get together a group of people (one with each of these skills) to undertake this kind of task, that the challenges of cross-disciplinary collaboration can slow down progress (or even prevent it entirely). As such, there is a need for individuals that have at least some experience in several of these fields to help facilitate the process. I think this is a good model for this kind of work in ecology, though given the already extensive multidisciplinarity required in the field I view this role as one occupied only be fairly small fraction of folks.

The other thing that I really liked about this post (and about Flowing Data’s broader message) is the focus on the end user. The goal is to make ideas and tools available to the broadest possible audience and sometimes often the more technical folks in the biological scientists seem to forget that their goal should be to make things easy to understand and simple for non-technical users to use. This is undoubtedly a challenging task, but one that we should work to accomplish whenever possible.

April fools for the statistically inclined

You can always count on Andrew Gelman for quality April Fools Day posts.

Ecological Samuri

Data is the sword of the 21st century, those who wield it well, the Samurai.

Jonathan Rosenberg, SVP, Product Management, Google

Definitely not the meaning of “non-significant”

Andrew Gelman over at Statistical Modeling, Causal Inference, and Social Science posted a hilariously awful story about the interpretation of a non-significant result he saw at a recent talk (I particularly love the Grrrrrrr at the end).

I’m always yammering on about the difference between significant and non-significant, etc. But the other day I heard a talk where somebody made an even more basic error: He showed a pattern that was not statistically significantly different from zero and he said it was zero. I raised my hand and said something like: It’s not _really_ zero, right? The data you show are consistent with zero but they’re consistent with all sorts of other patterns too. He replied, no, it really is zero: look at the confidence interval.

Grrrrrrr.

This and related misinterpretations crop up all the time in ecology. I’ve witnessed some particularly problematic cases where the scientist is interested in attempting to determine if some data are consistent with a theoretically predicted parameter and the confidence intervals are relatively wide. The CIs sometimes contain both 0 and the theoretically predicted value and yet it is concluded that the data are not consistent with the model because the parameter is “not significant”. This is obviously problematic given that the goal of the analysis in the first place had nothing to do with demonstrating a difference from 0.