Frequency distributions for ecologists IIb: CDFs and Kernel density estimates

Beyond simple histograms there are two basic methods for visualizing frequency distributions.

Kernel density estimation is basically a generalization of the idea behind histograms. The basic idea is to put an miniature distribution (e.g., a normal distribution) at the position of each individual data point and then add up those distributions to get an estimate of the frequency distribution. This is well developed field with a number of advanced methods for estimating the true form of the underlying frequency distribution. Silverman’s 1981 book is an excellent starting point for those looking for details. I like KDE as a general approach. It has the same issues as binning in that depending on the kernel width (equivalent to bin width) you can get different impressions of the data, but it avoids the issue of having to choose the position of bin edges. It also estimates a continous distribution which is what we are typically looking for.

Cumulative distribution functions (CDF for short) characterize the probability that a randomly drawn value of the x-variable will be less than or equal to a certain value of x. The CDF is conveniently the integral of the probability density function, so if you understand the CDF then simply taking it’s derivative will give you the frequency distribution. The CDF is nice because there are no choices that need to be made to construct a CDF for empirical data. No bin widths, no kernel widths, nothing. All you have to do is rank the n observed values of x from smallest to largest (i=1 . . . n). The probability that an observation is less than or equal to x (the CDF) is then estimated as i/n. If all of this seems a bit esoteric, just think about the rank-abundance distribution. This is basically just a CDF of the abundance distribution with the axes flipped (and one of them rescaled). Because of the objectivity of this approach it has recently been suggested that this is the best approach to visualizing species-abundance distributions. I’ve spent a fair bit of time working with CDFs for exactly this reason. The problem that I have run into with this approach is that except in certain straightforward situations it can be difficult/counterintuitive to grasp what a given CDF translates into with respect to the frequency distribution of interest.

To construct the CDF, first rank the n
observed values (xi) from smallest to largest (i ¼ 1 . . . n). The probability that an observation is less than
or equal to xi (the CDF) is then estimated as i/n

The take home message for visualization is that any of the available approaches done well and properly understood is perfectly satisfactory for visualizing frequency distribution data.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: