Frequency distributions for ecologists IV: comparing model performance


Likelihood, likelihood, likelihood (and maybe some other complicated approaches), but definitely not r^2 values from fitting regressions to binned data.

A bit more nitty gritty detail

In addition to causing issues with parameter estimation, binning based methods are also inappropriate when trying to determine which distribution provides the best fit to empirical data. As a result you won’t find any card carrying statistician recommending this approach. Basically binning and fitting regressions ignores the very nature of this kind of data generating bizarre error structures and making measures of model fit arbitrary and ungrounded in statistical theory. This isn’t something that is controversial in anyway. It is not “hotly contested” or open to debate despite what you may read in the ecological literature (i.e., Reynolds & Rhodes 2009), and you can’t (well, at least you shouldn’t) choose to use binning based methods just because someone else did (i.e., Maestre & Escudero 2009)*.

So, to be rigorous you want to use a more appropriate framework, which again should be likelihood (or Bayes; or something more complicated that I know nothing about; but if you’re taking the time to read this article you should probably start with likelihood). To determine the likelihood of a model given the data you simply take the product of the probability density function (pdf) evaluated at each value of x (put each value of x into the equation for the pdf with the parameters set to the maximum likelihood estimates and then multiply all of the resulting values together). Having done this you can use a likelihood ratio test to compare two distributions (if you’re into p-values this is for you) or you can use model selection based on an information criterion like AIC. With the likelihood in hand the AIC is then just 2k-2*ln(likelihood) (where k is the number of parameters). In practice you’ll probably want to calculate the ln(likelihood) to start (otherwise the values get really small and you’ll run into precision issues) so you would typically take the sum of the log of the pdf instead of the product described above. Andy Edwards 2007 Nature paper does a nice job of talking about this in the context of Levy Flights. It’s worth keeping in mind that the details can have an important influence here, so you’ll want to be sure that your pdfs have appropriately defined minimum and maximum and satisfy the other limitations on the parameters as well. This approach will yield valid AICs for comparing models. In contrast the AIC values in another recent Nature paper (which are based on binning the data, fitting regressions, and then estimating the likelihoods of those regressions) are not grounded in probability in the same way and in my opinion are not appropriate (at least without some Monte Carlo work to show that they at least perform well).

I’m not trying to give anyone a hard time about what they’ve done in the past. There really is a failure of education and discussion regarding how to deal with distributions in ecology. That said, now that the discussion of these issues has started to reach the broad ecological population we do need to be careful about unnecessarily and inappropriately fomenting a statistical controversy that doesn’t exist, so that we can move towards the use and refinement of the most rigorous methods available.

Further reading

If you’re looking for a good introduction to this area I highly recommend The Ecological Detective by Hilborn & Mangel. If you’re looking for something with more advanced material and technical detail I like In All Likelihood by Pawitan. I’ve also heard good things about Benjamin Bolker’s new book, but I have not yet read it myself.

*NB: I haven’t conducted any Monte Carlo work on this myself like I have for parameter estimation, but I have read quite a bit of statistical literature in this area and if you do the same I think you will find that statisticians don’t even consider the possibility of binning and fitting regressions, because it is so obviously disconnected from the question at hand.

One Comment on “Frequency distributions for ecologists IV: comparing model performance

  1. Pingback: Math World | Frequency distributions for ecologists IV: comparing model …

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: