# Characterizing the species-abundance distribution with only information on richness and total abundance [Research Summary]

This is the first of a new category of posts here at Jabberwocky Ecology called Research Summaries. We like the idea of communicating our research more broadly than to the small number of folks who have the time, energy, and interest to read through entire papers. So, for every paper that we publish we will (hopefully) also do a blog post communicating the basic idea in a manner targeted towards a more general audience. As a result these posts will intentionally skip over a lot of detail (technical and otherwise), and will intentionally use language that is less precise, in order to communicate more broadly. We suspect that it will take us quite a while to figure out how to do this well. Feedback is certainly welcome.

This is a Research Summary of: White, E.P., K.M. Thibault, and X. Xiao. 2012. Characterizing species-abundance distributions across taxa and ecosystems using a simple maximum entropy model. Ecology. http://dx.doi.org/10.1890/11-2177.1*

The species-abundance distribution describes the number of species with different numbers of individuals. It is well known that within an ecological community most species are relatively rare and only a few species are common, and understanding the detailed form of this distribution of individuals among species has been of interest in ecology for decades. This distribution is considered interesting both because it is a complete characterization of the commonness and rarity of species and because the distribution can be used to test and parameterize ecological models.

Numerous mathematical descriptions of this distribution have been proposed and much of the research into this pattern has focused on trying to figure out which of these descriptions is “the best” for a particular group of species at a small number of sites. We took an alternative approach to this pattern and asked: Can we explain broad scale, cross-taxonomic patterns in the general shape of the abundance distribution using a simple model that requires only knowledge of the species richness and total abundance (summed across all species) at a site?

To do this we used a model that basically describes the most likely form of the distribution if the average number of individuals in a species is fixed (which turns out to be a slightly modified version of the classic log-series distribution; see the paper or John Harte’s new book for details). As a result this model involves no detailed biological processes and if we know richness and total abundance we can predicted the abundance of each species in the community (i.e., the abundance of the most common species, second most common species… rarest species).

Since we wanted to know how well this works in general (not how well it works for birds in Utah or trees in Panama) we put together a a dataset of more than 15,000 communities. We did this by combining 6 major datasets that are either citizen science, big government efforts, or compilations from the literature. This compilation includes data on birds, trees, mammals, and butterflies. So, while we’re missing the microbes and aquatic species, I think that we can be pretty confident that we have an idea of the general pattern.

In general, we can do an excellent job of predicting the abundance of each rank of species (most abundant, second most abundant…) at each site using only information on the species richness and total abundance at the site. Here is a plot of the observed number of individuals in a given rank at a given site against the number predicted. The plot is for Breeding Bird Survey data, but the rest of the datasets produce similar results.

Observed-predicted plot for nearly 3000 Breeding Bird Survey communities. Since there are over 100,000 points on this plot we’ve color coded them by the number of points in the vicinity of the focal point, so red areas have lots of points nearby and blue areas have very few points. The black line is the 1:1 line.

The model isn’t perfect of course (they never are and we highlight some of its failures in the paper), but it means that if we know the richness and total abundance of a site then we can capture over 90% of the variation in the form of the species-abundance distribution across ecosystems and taxonomic groups.

This result is interesting for two reasons:

First, it suggests that the species-abundance distribution, on its own, doesn’t tell us much about the detailed biological processes structuring a community. Ecologists have know that it wasn’t fully sufficient for distinguishing between different models for a while (though we didn’t always act like it), but our results suggest that in fact there is very little additional information in the distribution beyond knowing the species richness and total abundance. As such, any model that yields reasonable richness and total abundance values will probably produce a reasonable species-abundance distribution.

Second, this means that we can potentially predict the full distribution of commonness and rarity even at locations we have never visited. This is possible because richness and total abundance can, at least sometimes, be well predicted using remotely sensed data. These predictions could then be combined with this model of the species-abundance distribution to make predictions for things like the number of rare species at a site. In general, we’re interested in figuring out how much ecological pattern and process can be effectively characterized and predicted at large spatial scales, and this research helps expand that ability.

So, that’s the end of our first Research Summary. I hope it’s a useful thing that folks get something out of. In addition to the science in this paper, I’m also really excited about the process that we used to accomplish this research and to make it as reproducible as possible. So, stay tuned for some follow up posts on big data in ecology, collaborative code development, and making ecological research more reproducible.

———————————————————————————————————————————————————————————————
*The paper will be Open Access once it is officially published but ,for reasons that don’t make a lot of sense to me, it is behind a paywall until it comes out in print.

### 3 Comments on “Characterizing the species-abundance distribution with only information on richness and total abundance [Research Summary]”

1. Hi Dave – Welcome to the blog and thanks for asking a question. Let me answer in two parts, the first more general and the second specific to BBS data.

Yes, there are definitely weaknesses with the BBS data and in fact with all of the data we used (and in reality all data). We touch on some of these issues in the supplement a bit, but mostly I tend to (incorrectly) act like these issues are conventional wisdom at this point. As a macroecologist I’m constantly faced with trying to understand the weaknesses of big datasets, which I didn’t collect and which by virtue of their scope sometimes have unique, or at least unavoidable, issues. My standard approach in deciding how to deal with this is very question specific – given the question that I’m interested in do the limitations of the data have the potential to create pattern or will they simply add noise. If the answer is the latter then I don’t worry about it too much unless I end up with a surprising negative result. If the answer is the former then there is more cause for concern and I typically want to both caveat the result and try to validate it with some (typically more restricted) data that doesn’t suffer from the same issue. In this particular case there was a concern that some of the sampling issues associated with some of the datasets would produce more log-series like distributions when comparing to the log-normal, so we offered a fairly strong caveat on over interpreting that comparison. The more general concern about relatively weak sample intensity changing the results more generally is addressed reasonably well (I think) by the use of a number of different datasets that don’t all share this issue. The Christmas Bird Count data is birds, but more intensive sampling, and the two tree datasets are both complete surveys (over some minimum DBH), so if sampling intensity per se strongly influences the results we’d expect to see BBS differ from some of the other datasets. This is one of the things that I like about working with multiple data sources simultaneously.

So, should one work with BBS abundance data? Like I mentioned above I think that really depends on the question. I think that most folks agree that if you’re doing population level work within similar ecogeographic regions and you don’t care about getting an absolute density (i.e., something proportional to the real density is OK) then you’re good to go. As you move into community level work differences in detection probabilities among species can become an issue and as you compare across large scales so can differences in detection probabilities across sites. I’m not sure whether converting to presence-absence solves the problem in general since rare species will certainly be missed, but again, this will depend on the question.

These are the sorts of issues that rarely get fully and properly discussed among the broader community because they don’t fit naturally in the current publication system, and metadata from the originators of datasets is often only a single perspective that is frequently a bit conservative for my taste when it comes to the usability of data outside of the specific intent for which it was originally collected. This is why I started the Ecological Data Wiki, to provide a place for these kinds of issues to be generally discussed so that the community could come to consensus (or at least clearly present differing view points). It hasn’t reached the level of dialog that I’m hoping for yet, but there are a reasonable number of users and some activity and I’m hoping to have more time to invest in it this summer.