Frequency distributions for ecologists V: Don’t let the lack of a perfect tool prevent you from asking interesting questions
I had an interesting conversation with someone the other day that made me think I needed one last frequency distribution post in order to avoid causing some people to not move forward with addressing interesting questions.
As a quantitative ecologist I spent a fair amount of time trying to figure out the best way to do things. In other words, I often want to know what the best method is available for answering a particular question. When I think I’ve figured this out I (sometimes, if I have the energy) try to communicate the best methodology more broadly to encourage good practice and accurate answers to questions of interest to ecologists. In some cases finding the best approach is fairly easy. For example, likelihood based methods for fitting and comparing simple frequency distributions are often straightforward and can be easily looked up online. However, in many cases the methodological challenges are more substantial, or the question being asked is not general enough that the methods have been worked out and clearly presented. This happens in the case of frequency distributions when one needs non-standard minimum and maximum values (a common case in ecological studies) or when one needs discrete analogs of traditionally continuous distributions. It’s not that these cases can’t be addressed, it’s just that you can’t look the solutions up on Wikipedia.
So, what is someone without a sufficient background to do (and, btw, that might be all of us if the problem is really hard or even… intractable). First, I’d recommend trying to ask for help. Talk to a statistician at your university or a quantitative colleague and see if they can help you figure things out. I am always pleased to try to help out because I always learn something in the process. Then, if that fails, just do something. Morgan and I will probably write more about this later, but please, please, please don’t let the questions you ask as an ecologists be defined by the availability of an ideal statistical methodology that is easy to implement. In the context of the current series of posts, if you are trying to do something with a more complex frequency distribution and you can’t find a solution to your problem using likelihood then use something else. If it was me I’d go with either normalized logarithmic binning or something based on the CDF as these methods can behave reasonably well. Sure, people like me may complain, but that’s fine. Just make clear that you are aware of the potential weaknesses and that you did what you did because you couldn’t figure out an appropriate alternative approach. That way you still get to make progress on the question of interest and you may motivate people to help work on developing better methods. Sure, you might be the presenting the “right” answer, but then I very much doubt that we ever are when studying ecological systems anyway.
Many of us have had the feeling that something is not right these days with the peer-review system in science. Whenever I chat with colleagues about the peer review system, two issues consistently crop up: an increasing number of review requests that we cannot possibly keep up with and/or reviews that seem to indicate a reviewer did not spend much time with the manuscript they were reviewing. So, when Ecology Letters published an article in 2008 (Hochberg et al), written by a group of its editors, titled “The tragedy of the reviewer commons”, I read with great interest. However, I was dismayed to see that apparently the entire fault for the current sad state of affairs lay with people like me: reviewers and authors. I was slightly peeved at the tone of the article that implied that things would improve if only reviewers/authors behaved better. Where was the responsibility of the journals/editors in this mess? I thought, “I really need to write a blog post on this”. I never got around to it. Since then, at conferences and in additional publications (e.g., McPeek et al 2008), I have heard the same refrains: Scientists need to review faster, better, smarter. I began to wonder if I was alone in this world in my feelings that reviewers/authors are only half of the equation. Then I read a blog article over at the Chronicle for Higher Education. This article was also about the problems with the peer-review system, but from the perspective of a reviewer/author. And I realized not only was I not alone, but that we needed more voices demanding real dialogue on this issue. So here we go: a reviewer/author’s take on how journals/editors can help reviewers/authors make journal/editors happier.
1) Better reviewer databases: I say no a lot to reviews because I say yes a lot to reviews, not because I lack a sense of scientific responsibility. The Chronicle blog (by a sociologist) points out that the number of members in the American Sociological Association is more than enough to support a reasonable number of reviews/person. However, a much smaller number of people seem to be shouldering the load. I suspect the same is true for ecology. So why is this? Undoubtedly the journals are right that there are curmudgeons who simply refuse to review. But I also suspect that editors are busy people like the rest of us and when we are busy we go with the names of people who come to mind quickly; these “go-to” people are “the most obvious people” to review a paper or give a talk. However, those go-to people are often the same for many people – resulting in the smaller number of people getting a higher load of review requests. As a reviewer I try to help with this situation by recommending people I think are not yet “in the system” (post-docs, young assistant professors, etc), but I might humbly suggest that journals invest in better reviewer databases to help editors come up with a better diversity of names.
2) More editorial control: My next two suggestions are not going to make me popular with either authors or editors. And I know (if they got implemented) I would occasionally get hoisted in my own petard, but I strongly believe that with the demands journals are making on reviewers theses days (thorough reviews, lots of reviews, quick reviews) journals have a responsibility to protect reviewers from superfluous reviews (i.e. unnecessary review requests).
a) Better pre-review vetting. Many authors will hate this because this means one person is probably deciding whether or not to send something out for review. A bad draw on an editor (who has a strong personal opinion on the validity/novelty of your work) can kill your submission. However, I am not alone in having received manuscripts for review that are so poorly written that they are in effect incomprehensible or so far from the journal’s standard that clearly no editor looked at the manuscript before sending it on to me. I’m not talking about borderline cases but manuscripts so bad I barely know how to review them. As a reviewer this just makes me mad and takes up valuable time that could have been dedicated to a manuscript that actually deserved consideration. As the Chronicle post, points out: manuscripts do not have a fundamental right to be reviewed.
b) Stop looking for reviewer consensus. I have noticed a trend at certain journals: manuscripts keep being sent back to the reviewer until the reviewer “signs off” on the manuscript. This is consistent with the idea in the Ecology Letters article that authors are needlessly lengthening the review process by ignoring reviewer comments. As much as we may all wish otherwise, not all reviewer comments reflect absolute truth. We all have our opinions on things that (if we’re being honest with ourselves) actually are in gray areas. Sometimes reviewers just flub things. And, journals are right, sometimes reviewers give shoddy reviews. As both a reviewer and an author I recognize this. As a reviewer, I assume the editor will read my review (and the paper) and decide for his or herself whether they agree with my opinion. As an author, I assume that the editor will read my response to a reviewer and decide whether my objections to a certain critique have merit. As a reviewer, the only time I want to re-review a paper is if I have labeled my concern as “fatal” and the editor is uncertain whether the authors have either dealt with that concern or have a valid argument for why it is not a concern. In a world where reviewers are scarce, manuscripts should only go back to reviewers when absolutely necessary. This requires editors to insert themselves more into the process than perhaps they have been accustomed.
Maybe journals and editors already feel like they do these things. I don’t know. I do know I feel like I already do the things they want me as a reviewer to do! However, given how widespread concern over the strain on the peer-review process is, it seems to me that perhaps it’s time for a real dialogue – and that involves both sides talking about their perspectives and making suggestions about how to improve things. Anyone out there have additional ideas for things that could be done?
A couple of weeks ago we made it possible for folks to subscribe to JE using email. We did this because we realized that many scientists, even those who are otherwise computationally savvy, really haven’t embraced feed readers as a method of tracking information. When I wrote that post I promised to return with an argument for why you should start using a feed reader instead – so here it is. If anyone is interested in a more instructional post about how to do this then let us know in the comments.
The main argument
I’m going to base my argument on something that pretty much all practicing scientists do – keeping track of the current scientific literature by reading Tables of Contents (TOCs). Back in the dark ages the only way to get these TOCs was to either have a personal subscription to the journal or to leave the office and walk the two blocks to the library (I wonder if anyone has done a study on scientists getting fatter now that they don’t have to go to the library anymore). About a decade ago (I’m not really sure when, but this seems like it’s in the right ballpark) journals started offering email subscriptions to their TOCs. Every time a new issue was published you’d receive an email that included the titles and authors of each contribution and links to the papers (once the journal actually had the papers online of course). This made it much easier to keep track of the papers being published in a wide variety of journals by speeding up the process of determining if there was anything of interest in a given issue. While the increase in convenience of using a feed reader may not be on quite the same scale as that generated by the email TOCs, it is still fairly substantial.
The nice thing about feed readers is that they operate one item at a time. So, instead of receiving one email with 10-100 articles in it, you receive 10-100 items in your feed reader. This leads to the largest single advantage of feeds over email for tracking TOCs. You only need to process one article at a time. Just think about the last time you had 5 minutes before lunch and you decided to try to clear an email or two out of your inbox. You probably opened up a TOC email and started going through it top to bottom. If you were really lucky then maybe there were only a dozen papers and none of them were of interest and you could finish going through the email and delete it. Most of the time however there are either too many articles or you want to look at at least one so you go to the website, read the abstract, maybe download the paper, and the next thing you know it’s time for lunch and you haven’t finished going through the table so it continues to sit in your inbox. Then, of course, by the time you get back to it you probably don’t even remember where you left off and you basically have to start back at the beginning again. I don’t know about you but this process typically resulted in my having dozens of emailed TOCs lying around my inbox at any one time.
With a feed reader it’s totally different. If you have five minutes you start going through the posts for individual articles one at a time. If you have five minutes you can often clear out 5 or 10 articles (or even 50 if the feed is well tagged like PNAS’s feed), which means that you can use your small chunks of free time much more effectively for keeping up with the literature. In addition, all major feed readers allow you to ‘star’ posts – in other words you can mark them in such a way that you can go back to them later and look at them in more detail. So, instead of the old system where if you were interested in looking at a paper you had to stop going through the table of contents, go to the website, decide from the abstract if you wanted to actually look at the paper, and then either download or print a copy of the paper to look at later, with a feed reader you achieve the same thing with a one second click. This means that you can often go through a fairly large TOCs in less than 10 minutes.
Of course much of this utility depends on the journals actually providing feeds that include all of the relevent information.
Keeping your TOCs and other feeds outside of your email allows for greater separation of different aspects of online communication. If you monitor your email fairly continuously, the last thing you need is to receive multiple TOC emails each day that could distract you from actually getting work done. Having a separate feed reader let’s you actually decide when you want to look at this information (like in those 5 minutes gaps before lunch or at the end of the day when you’re too brain dead to do anything else).
Now that journals post many of their articles online as soon as the proofs stage is complete, it can be advantageous to know about these articles as soon as they are available. Most journal feeds do exactly this, posting a few papers at a time as they are uploaded to the online-early site.
Sharing – want to tell your friends about a cool paper you just read. You could copy the link, open a new email, paste the link and then send it on to them. Or, you could accomplish this with a single click (NB: this technology is still developing and varies among feed readers).
And then of course there are blogs
I’ve attempted to appeal to our non-feedreader-readers by focusing on a topic that they can clearly identify with. That said, the world of academic communication is rapidly expanding beyond the walls of the journal article. Blogs play an increasingly important role in scientific discourse and if you’re going to follow blogs you really need a feed reader. Why? Because while some blogs update daily (e.g., most of the blogs over at ScienceBlogs) many good blogs update at an average rate of once a week, or once a month. You don’t want to have to check the webpage of one of these blogs every day just to see if something new has been posted, so subscribe to its feed, kick back, and let the computer tell you what’s going on in the world.
This is a table of contents of sorts for five posts on the visualization, fitting, and comparison of frequency distributions. The goal of these posts is to expose ecologists to the ideas and language related to good statistical practices for addressing frequency distribution data. The focus is on simple distributions and likelihood methods. The information provided here is far from comprehensive, but my aim is to give readers a good place to start when exploring this kind of data.
Likelihood, likelihood, likelihood (and maybe some other complicated approaches), but definitely not r^2 values from fitting regressions to binned data.
A bit more nitty gritty detail
In addition to causing issues with parameter estimation, binning based methods are also inappropriate when trying to determine which distribution provides the best fit to empirical data. As a result you won’t find any card carrying statistician recommending this approach. Basically binning and fitting regressions ignores the very nature of this kind of data generating bizarre error structures and making measures of model fit arbitrary and ungrounded in statistical theory. This isn’t something that is controversial in anyway. It is not “hotly contested” or open to debate despite what you may read in the ecological literature (i.e., Reynolds & Rhodes 2009), and you can’t (well, at least you shouldn’t) choose to use binning based methods just because someone else did (i.e., Maestre & Escudero 2009)*.
So, to be rigorous you want to use a more appropriate framework, which again should be likelihood (or Bayes; or something more complicated that I know nothing about; but if you’re taking the time to read this article you should probably start with likelihood). To determine the likelihood of a model given the data you simply take the product of the probability density function (pdf) evaluated at each value of x (put each value of x into the equation for the pdf with the parameters set to the maximum likelihood estimates and then multiply all of the resulting values together). Having done this you can use a likelihood ratio test to compare two distributions (if you’re into p-values this is for you) or you can use model selection based on an information criterion like AIC. With the likelihood in hand the AIC is then just 2k-2*ln(likelihood) (where k is the number of parameters). In practice you’ll probably want to calculate the ln(likelihood) to start (otherwise the values get really small and you’ll run into precision issues) so you would typically take the sum of the log of the pdf instead of the product described above. Andy Edwards 2007 Nature paper does a nice job of talking about this in the context of Levy Flights. It’s worth keeping in mind that the details can have an important influence here, so you’ll want to be sure that your pdfs have appropriately defined minimum and maximum and satisfy the other limitations on the parameters as well. This approach will yield valid AICs for comparing models. In contrast the AIC values in another recent Nature paper (which are based on binning the data, fitting regressions, and then estimating the likelihoods of those regressions) are not grounded in probability in the same way and in my opinion are not appropriate (at least without some Monte Carlo work to show that they at least perform well).
I’m not trying to give anyone a hard time about what they’ve done in the past. There really is a failure of education and discussion regarding how to deal with distributions in ecology. That said, now that the discussion of these issues has started to reach the broad ecological population we do need to be careful about unnecessarily and inappropriately fomenting a statistical controversy that doesn’t exist, so that we can move towards the use and refinement of the most rigorous methods available.
If you’re looking for a good introduction to this area I highly recommend The Ecological Detective by Hilborn & Mangel. If you’re looking for something with more advanced material and technical detail I like In All Likelihood by Pawitan. I’ve also heard good things about Benjamin Bolker’s new book, but I have not yet read it myself.
*NB: I haven’t conducted any Monte Carlo work on this myself like I have for parameter estimation, but I have read quite a bit of statistical literature in this area and if you do the same I think you will find that statisticians don’t even consider the possibility of binning and fitting regressions, because it is so obviously disconnected from the question at hand.
Don’t bin you’re data and fit a regression. Don’t use the CDF and fit a regression. Use maximum likelihood or other statistically grounded approaches that can typically be looked up on Wikipedia.
A bit more detail
OK, so you’ve visualized your data and after playing around a bit you have an idea of what the basic functional form of the model is. Now you want to estimate the parameters. So, for example, you’ve log-transformed the axes and you’re distribution is approximately a straight line so you think it’s a power-law and you want to estimate the exponent of the power-law (more about figuring out if this is actually the best model for you data in the next and final installment).
Ecologists typically fall back on what they know from an introductory statistics class or two to solve this problem – regression. They bin the data, count how many points occur for each binned set of x values and use those numbers for the y data and the bin center for the x data. They then fit a regression to these points to estimate parameters. You can’t blame us as a group for this approach because we typically don’t received training in the proper methods and we’re just doing out best. However, this approach is not valid and can yield very poor parameter estimates in some cases.
The best approach actually varies a bit depending on the specific form of the function, but generally what you want to do is maximum likelihood estimation (MLE). This approach basically determines the values of the free parameters that are most likely (i.e., have the greatest probability) of producing the observed data. Technically what you really want is actually the minimum variance unbiased estimator (MVUE), which will often be a slightly modified form of the maximum likelihood estimate. The bad news is that most ecologists won’t want to calculate the MLE themselves. The good news is that equations for calculating the MLE (or the MVUE) are readily available, and once you have the equation this approach is actually far less work than binning and fitting lines to data. For most common distributions you can just google the name of the distribution. The first link will typically be the Wikipedia entry. Just go to the Wikipedia page, scroll down to Parameter estimation and use the equation provided (and yes Wikipedia is quite reliable for this kind of information). Even if the best approach isn’t maximum likelihood this will give you typically provide you with the best approach (or at least a very good one). If you’re doing something more complicated there is an excellent series of reference books by Norman Johnson and colleagues that covers parameter estimation for most characterized distributions in detail. Some of these solutions will require the use of numerical methods. This is fairly straightforward in R and Matlab, but if that’s a bit much you should be able to find a statistician or quantitatively inclined ecological colleague to help (they’ll be impressed that you got as far as you did).
Recommended reading (focused on power-laws because that’s what I’m most familiar with)
After writing about the importance of good RSS feeds for a particular subset of the academic community it occurred to me that part of the reason that we have such hit and miss implementations of feeds by journals is that most academics don’t even know what a feed is let alone actually use a feed reader. If this is you then we still want you to be able to get regular updates from JE, so last night I setup a new feed using Feedburner. What this means to you is that you can now subscribe to JE using email. If you follow the link in the upper right hand corner you will get a single email each morning that we post new content. If you’re curious about using our RSS feed instead, I’m going to try to put up a post in the next few days to explain the benefits of this approach over email for keeping track to Tables of Contents, so you may want to wait to see if you’re convinced to start using a feed reader.
Those of you who are already subscribed to our WordPress feed have nothing to worry about. It’s not going anywhere.
I’d recommend checking out this post by River Continua about an impressively sophisticated phishing scam targeted at academics. They’re going to catch a bunch of folks with this one.
UPDATE: Apparently this is something that the EPA does that the EPA employee who wrote the original post was unaware of. They definitely need to rethink the composition of the email though as I would have been (and obviously was) equally suspicious.
- If you don’t have an easily accessible RSS feed available (and by easily accessible I mean in the browser’s address bar on your journal’s main page) for your journal’s Table of Contents (TOCs), there is a certain class of readers who will not keep track of you TOCs. This is because receiving this information via email is outdated and inefficient and if you are in the business of content delivery it is, at this point, incompetent for you to not have this option (it’s kind of like not having a website 10 years ago).
- If, for some technophobic reason, you refuse to have an RSS feed, then please, pretty please with suger on top, don’t hide the ability to subscribe to the TOCs behind a username/password wall. All you need is a box for people to add their email addresses to for subscribing and a prominent unsubscribe link in the emails (if you are really paranoid you can add a confirmation email with a link that needs to be followed to confirm the subscription).
- Most importantly. Please, for the love of all that is good and right in the world, DO NOT START AN RSS FEED AND THEN STOP UPDATING IT. Those individuals who track a large number of feeds in their feed readers will not notice that you stopped updating your feed for quite some time. You are losing readers when you do this.
- If you have an RSS feed that is easily accessible (congratulations, you’re ahead of many Elsevier journals) please try to maximize the amount of information it provides. There are three critical pieces of information that should be included in every TOCs feed:
- The title (you all manage to do this one OK)
- All of the authors’ names. Not just the first author. Not just the first and last author. All of the authors. Seriously, part of the decision making process when it comes to choosing whether or not to take a closer look at a paper is who the authors are. So, if you want to maximize the readership of papers, include all of the authors’ names in the RSS feed.
- The abstract. I cannot fathom why you would exclude the abstract from your feed, other than to generate click throughs to your website. Since those of you doing this (yes, Ecology, I’m talking about you) aren’t running advertising, this isn’t a good reason, since you can communicate the information just as well in the feed (and if you’re using website visits as some kind of metric, don’t worry, you can easily track how many people are subscribed to your feed as well).
If this seems a bit harsh, whiny, etc., then keep this in mind. In the last month I had over 1000 new publications come through my feed reader and another 100 or so in email tables of contents. This is an incredible amount of material just to process, let alone read. If journals want readers to pay attention to their papers it is incumbent upon them to make it as easy as possible to sort through this deluge of information and allow their readership to quickly and easily identify papers of interest. Journals that don’t do this are hurting themselves as well as their readers.
During the course of this long volume I have undoubtedly plagiarized from many sources–to use the ugly term that did not bother Shakespeare’s age. I doubt whether any criticism or cultural history has ever been written without such plagiary, which inevitably results from assimilating the contributions of your countless fellow-workers, past and present. The true function of scholarship as a society is not to stake out claims on which others must not trespass, but to provide a community of knowledge in which others may share.
-F. O. Matthiessen, American Renaissance 1941.
Thanks to academhack for pointing me to this great quote. Given the spirit of the quote I don’t think he’ll mind me reposting it. I find this to be a particularly relevant in light of recent discussion about tracking down self-plagarism and how grave an offense it may be. I’m not saying that self, and regular, plagarism aren’t serious issues. I’ve been involved in reporting a case of self-plagarism myself and it’s disturbing when you see it. It’s also clearly bad for science. It clutters the already crowded literature, has a negative influence on broader perceptions regarding the ethics of scientists, and results in undue credit (which presumably influences funding and promotion). That said, I think it’s important to keep the bigger picture in mind. We are, after all, in the business of ideas. Words are important for communicating those ideas, but the ideas themselves are the currency of interest. Our ideas are influenced by everyone we talk to, colored by every paper we’ve ever read and talk we’ve ever seen. The goal of science is (or at least should be) to progress our knowledge as rapidly as possible. I think that conversations surrounding plagarism, and what it means in this new era, should start from this core goal and proceed from there.
Beyond simple histograms there are two basic methods for visualizing frequency distributions.
Kernel density estimation is basically a generalization of the idea behind histograms. The basic idea is to put an miniature distribution (e.g., a normal distribution) at the position of each individual data point and then add up those distributions to get an estimate of the frequency distribution. This is well developed field with a number of advanced methods for estimating the true form of the underlying frequency distribution. Silverman’s 1981 book is an excellent starting point for those looking for details. I like KDE as a general approach. It has the same issues as binning in that depending on the kernel width (equivalent to bin width) you can get different impressions of the data, but it avoids the issue of having to choose the position of bin edges. It also estimates a continous distribution which is what we are typically looking for.
Cumulative distribution functions (CDF for short) characterize the probability that a randomly drawn value of the x-variable will be less than or equal to a certain value of x. The CDF is conveniently the integral of the probability density function, so if you understand the CDF then simply taking it’s derivative will give you the frequency distribution. The CDF is nice because there are no choices that need to be made to construct a CDF for empirical data. No bin widths, no kernel widths, nothing. All you have to do is rank the n observed values of x from smallest to largest (i=1 . . . n). The probability that an observation is less than or equal to x (the CDF) is then estimated as i/n. If all of this seems a bit esoteric, just think about the rank-abundance distribution. This is basically just a CDF of the abundance distribution with the axes flipped (and one of them rescaled). Because of the objectivity of this approach it has recently been suggested that this is the best approach to visualizing species-abundance distributions. I’ve spent a fair bit of time working with CDFs for exactly this reason. The problem that I have run into with this approach is that except in certain straightforward situations it can be difficult/counterintuitive to grasp what a given CDF translates into with respect to the frequency distribution of interest.
The take home message for visualization is that any of the available approaches done well and properly understood is perfectly satisfactory for visualizing frequency distribution data.