Category Archives: statistics

Postdoc in Evolutionary Bioinformatics [Jobs]

There is an exciting postdoc opportunity for folks interested in quantitative approaches to studying evolution in Michael Gilchrist’s lab at the University of Tennessee. I knew Mike when we were both in New Mexico. He’s really sharp, a nice guy, and a very patient teacher. He taught me all about likelihood and numerical maximization and opened my mind to a whole new way of modeling biological systems. This will definitely be a great postdoc for the right person, especially since NIMBioS is at UTK as well. Here’s the ad:

Outstanding, motivated candidates are being sought for a post-doctoral position in the Gilchrist lab in the Department of Ecology & Evolutionary Biology at the University of Tennessee, Knoxville. The successful candidate will be supported by a three year NSF grant whose goal is to develop, integrate and test mathematical models of protein translation and sequence evolution using available genomic sequence and expression level datasets. Publications directly related to this work include Gilchrist. M.A. 2007, Molec. Bio. & Evol. (http://www.tinyurl/shahgilchrist11) and Shah, P. and M.A. Gilchrist 2011, PNAS (http://www.tinyurl/gilchrist07a).

The emphasis of the laboratory is focused on using biologically motivated models to analyze complex, heterogeneous datasets to answer biologically motivated questions. The research associated with this position draws upon a wide range of scientific disciplines including: cellular biology, evolutionary theory, statistical physics, protein folding, differential equations, and probability. Consequently, the ideal candidate would have a Ph.D. in either biology, mathematics, physics, computer science, engineering, or statistics with a background and interest in at least one of the other areas.

The researcher will collaborate closely with the PIs (Drs. Michael Gilchrist and Russell Zaretzki) on this project but potentiall have time to collaborate on other research projects with the PIs. In addition, the researcher will have opportunties to interact with other faculty members in the Division of Biology as well as researchers at the National Institute for Mathematical and Biological Synthesis (http://www.nimbios.org).

Review of applications begins immediately and will continue until the position is filled. To apply, please submit curriculum vitae including three references, a brief statement of research background and interests, and 1-3 relevant manuscripts to mikeg[at]utk[dot]edu.

RStudio [Things you should use]

If you use R (and it seems like everybody does these days) then you should check out RStudio – an easy to install, cross-platform IDE for R. Basically it’s a seamless integration of all of the aspects of R (including scripts, the console, figures, help, etc.) into a single easy to use package. For those of you are familiar with Matlab, it’s a very similar interface. It’s not a full blown IDE yet (no debugger; no lint) but what this actually means is that it’s simple and easy to use. If you use R I can’t imagine that you won’t love this new (and open source!) tool.

UPDATE: Check out another nice article on RStudio over at i’m a chordata! urochordata!

Thoughts on developing a digital presence

A while ago there was a bit of discussion around the academic blogosphere recently regarding the importance of developing a digital presence and what the best form of that presence might be. Recently as I’ve been looking around at academics’ websites as part of faculty, postdoc and graduate student searchers going on in my department/lab I’ve been reminded of the importance of having a digital presence.

It seems pretty clear to me that the web is the primary source of information acquisition for most academics, at least up through the young associate professors. There are no doubt some senior folk who would still rather have a paper copy of a journal sent to them via snail mail and who rarely open their currently installed copy of Internet Explorer 6, but I would be very surprised if most folks who are evaluating graduate student, postdoctoral and faculty job candidates aren’t dropping the name of the applicant into their favorite search engines and seeing what comes up. They aren’t looking around for dirt like all those scary news stories that were meant to stop college students from posting drunken photos of themselves on social networking sites. They’re just procrastinating looking for more information to get a clearer picture of you as a scientist/academic. I also do a quick web search when I meet someone interesting at a conference, get a paper/grant to review with authors I haven’t heard of before, read an interesting study by someone I don’t know, etc. Many folks who apply to join my lab for graduate school find me through the web.

When folks go looking around for you on the web you want them to find something (not finding anything is the digital equivalent of “being a nobody”), and better yet you want them to find something that puts your best foot forward. But what should this be? Should you Tweet, Buzz, be LinkedIn, start a Blog, have a Wiki*, or maybe just get freaked out by all of this technology and move to the wilderness somewhere and never speak to anyone ever again.

I think the answer here is simple: start with a website. This is the simplest way to present yourself to the outside world and you can (and should) start one as soon as you begin graduate school. The website can be very simple. All you need is a homepage of some kind, a page providing more detailed descriptions of your research interests, a CV, a page listing your publications†, and a page with your contact information. Keep this updated and looking decent and you’ll have as good an online presence as most academics.

While putting together your own website might seem a little intimidating it’s actually very easy these days. The simplest approach is to use one of the really easy hosted solutions out there. These include things like Google Sites, which are specifically designed to let you make websites; or you can easily turn a hosted blogging system into a website (WordPress.com is often used for this). There are lots of other good options out there (let us know about your favorites in the comments). In addition many universities have some sort of system set up for letting you easily make websites, just ask around. Alternatively, you can get a static .html based template and then add your own content to it. Open Source Web Design is the best place I’ve found for templates. You can either open up the actual html files or you can use a WYSIWYG editor to replace the sample text with your own content. SeaMonkey is a good option for a WYSIWYG editor. Just ask your IT folks how to get these files up on the web when you’re done.

So, setting up a website is easy, but should you be doing other things as well and if so what. At the moment I would say that if you’re interested in trying out a new mode of academic communication then you should pick one that sounds like fun to you and give it a try; but this is by no means a necessity as an academic at the moment. If you do try to do some of these other things, then do them in moderation. It’s easy to get caught up in the rapid rewards of finishing a blog post or posting a tweet on Twitter, not to mention keeping up with others blogs and tweets, but this stuff can rapidly eat up your day and for the foreseeable future you won’t be getting a job based on your awesome stream of 140 character or less insights.

*Yep, that’s right, it’s a link to the Wikipedia page on Wiki’s.
†And links to copies of them if you are comfortable flaunting the absurd copyright/licensing policies of many of the academic publishers (or if you only published in open access journals).

Postdoctoral position in macroecology, quantitative ecology, and ecoinformatics

We have a postdoc position available for someone interested in the general areas of macroecology, quantitative ecology, and ecoinformatics. Here’s the short ad with links to the full job description:

Ethan White’s lab at Utah State University is looking for a postdoc to collaborate on research studying approaches for unifying macroecological patterns (e.g., species abundance distributions and species-area relationships) and predicting variation in these patterns using ecological and environmental variables. The project aims to 1) evaluate the performance of models that link ecological patterns by using broad scale data on at least three major taxonomic groups (birds, plants, and mammals); and 2) combine models with ecological and environmental factors to explain continental scale variation in community structure. Models to be explored include maximum entropy models, neutral models, fractal based models, and statistical models. The postdoc will also be involved in an ecoinformatics initiative developing tools to facilitate the use of existing ecological data. There will be ample opportunity for independent and collaborative research in related areas of macroecology, community ecology, theoretical ecology, and ecoinformatics. The postdoc will benefit from interactions with researchers in Dr. White’s lab, the Weecology Interdisciplinary Research Group, and with Dr. John Harte’s lab at the University of California Berkeley. Applicants from a variety of backgrounds including ecology, mathematics, statistics, physics and computer science are encouraged to apply. The position is available for 1 year with the possibility for renewal depending on performance, and could begin as early as September 2010 and no later than May 2011. Applications will begin to be considered starting on September 1, 2010. Go to the USU job page to see the full advertisement and to apply.

If you’re interested in the position and are planning to be at ESA please leave a comment or drop me an email (ethan.white@usu.edu) and we can try to set up a time to talk while we’re in Pittsburgh. Questions about the position and expressions of interest are also welcome.

UPDATE: This position has been filled.

Rise of the neoFisherian statistical paradigm

I’ve been meaning to get around to posting about Stuart Hurlbert and Cecilia Lombardi’s recent paper (2009; Ann. Zool. Fennici 46: 311–349) on the use of p-values in drawing scientific conclusions… but thankfully Jarrett Byrnes over at i’m a chordata! urochordata! wrote such a great post about it that all I need to do is point you over to his place. Just so you know what you’re getting into, Hurlbert & Lombardi provide a convincing argument against the sanctity of the canonical alpha value of 0.05 and against the use of alpha values and ‘statistically significant’ in general. Instead they recommend (quoting Jarrett):

1) Report a p-value for a test. 2) Do not assign it significance, but rather refer to the level of support it gives for rejecting a null – strong, weak, moderate, practically non-existent. Make sure this statement of support is grounded in the design and power of the experiment. Suspend judgement on rejecting a null if the p value is high, as p-value testing is NOT the same as giving evidence FOR a null (something so many of us forget). 3) Use this in accumulation with other lines of evidence to draw a conclusion about a research hypothesis.

Go check out the full post. It’s well worth the read.

Frequency distributions for ecologists V: Don’t let the lack of a perfect tool prevent you from asking interesting questions

I had an interesting conversation with someone the other day that made me think I needed one last frequency distribution post in order to avoid causing some people to not move forward with addressing interesting questions.

As a quantitative ecologist I spent a fair amount of time trying to figure out the best way to do things. In other words, I often want to know what the best method is available for answering a particular question. When I think I’ve figured this out I (sometimes, if I have the energy) try to communicate the best methodology more broadly to encourage good practice and accurate answers to questions of interest to ecologists. In some cases finding the best approach is fairly easy. For example, likelihood based methods for fitting and comparing simple frequency distributions are often straightforward and can be easily looked up online. However, in many cases the methodological challenges are more substantial, or the question being asked is not general enough that the methods have been worked out and clearly presented. This happens in the case of frequency distributions when one needs non-standard minimum and maximum values (a common case in ecological studies) or when one needs discrete analogs of traditionally continuous distributions. It’s not that these cases can’t be addressed, it’s just that you can’t look the solutions up on Wikipedia.

So, what is someone without a sufficient background to do (and, btw, that might be all of us if the problem is really hard or even… intractable). First, I’d recommend trying to ask for help. Talk to a statistician at your university or a quantitative colleague and see if they can help you figure things out. I am always pleased to try to help out because I always learn something in the process. Then, if that fails, just do something. Morgan and I will probably write more about this later, but please, please, please don’t let the questions you ask as an ecologists be defined by the availability of an ideal statistical methodology that is easy to implement. In the context of the current series of posts, if you are trying to do something with a more complex frequency distribution and you can’t find a solution to your problem using likelihood then use something else. If it was me I’d go with either normalized logarithmic binning or something based on the CDF as these methods can behave reasonably well. Sure, people like me may complain, but that’s fine. Just make clear that you are aware of the potential weaknesses and that you did what you did because you couldn’t figure out an appropriate alternative approach. That way you still get to make progress on the question of interest and you may motivate people to help work on developing better methods. Sure, you might be the presenting the “right” answer, but then I very much doubt that we ever are when studying ecological systems anyway.

Frequency distributions for ecologists

This is a table of contents of sorts for five posts on the visualization, fitting, and comparison of frequency distributions. The goal of these posts is to expose ecologists to the ideas and language related to good statistical practices for addressing frequency distribution data. The focus is on simple distributions and likelihood methods. The information provided here is far from comprehensive, but my aim is to give readers a good place to start when exploring this kind of data.

Frequency distributions for ecologists IV: comparing model performance

Summary

Likelihood, likelihood, likelihood (and maybe some other complicated approaches), but definitely not r^2 values from fitting regressions to binned data.

A bit more nitty gritty detail

In addition to causing issues with parameter estimation, binning based methods are also inappropriate when trying to determine which distribution provides the best fit to empirical data. As a result you won’t find any card carrying statistician recommending this approach. Basically binning and fitting regressions ignores the very nature of this kind of data generating bizarre error structures and making measures of model fit arbitrary and ungrounded in statistical theory. This isn’t something that is controversial in anyway. It is not “hotly contested” or open to debate despite what you may read in the ecological literature (i.e., Reynolds & Rhodes 2009), and you can’t (well, at least you shouldn’t) choose to use binning based methods just because someone else did (i.e., Maestre & Escudero 2009)*.

So, to be rigorous you want to use a more appropriate framework, which again should be likelihood (or Bayes; or something more complicated that I know nothing about; but if you’re taking the time to read this article you should probably start with likelihood). To determine the likelihood of a model given the data you simply take the product of the probability density function (pdf) evaluated at each value of x (put each value of x into the equation for the pdf with the parameters set to the maximum likelihood estimates and then multiply all of the resulting values together). Having done this you can use a likelihood ratio test to compare two distributions (if you’re into p-values this is for you) or you can use model selection based on an information criterion like AIC. With the likelihood in hand the AIC is then just 2k-2*ln(likelihood) (where k is the number of parameters). In practice you’ll probably want to calculate the ln(likelihood) to start (otherwise the values get really small and you’ll run into precision issues) so you would typically take the sum of the log of the pdf instead of the product described above. Andy Edwards 2007 Nature paper does a nice job of talking about this in the context of Levy Flights. It’s worth keeping in mind that the details can have an important influence here, so you’ll want to be sure that your pdfs have appropriately defined minimum and maximum and satisfy the other limitations on the parameters as well. This approach will yield valid AICs for comparing models. In contrast the AIC values in another recent Nature paper (which are based on binning the data, fitting regressions, and then estimating the likelihoods of those regressions) are not grounded in probability in the same way and in my opinion are not appropriate (at least without some Monte Carlo work to show that they at least perform well).

I’m not trying to give anyone a hard time about what they’ve done in the past. There really is a failure of education and discussion regarding how to deal with distributions in ecology. That said, now that the discussion of these issues has started to reach the broad ecological population we do need to be careful about unnecessarily and inappropriately fomenting a statistical controversy that doesn’t exist, so that we can move towards the use and refinement of the most rigorous methods available.

Further reading

If you’re looking for a good introduction to this area I highly recommend The Ecological Detective by Hilborn & Mangel. If you’re looking for something with more advanced material and technical detail I like In All Likelihood by Pawitan. I’ve also heard good things about Benjamin Bolker’s new book, but I have not yet read it myself.

*NB: I haven’t conducted any Monte Carlo work on this myself like I have for parameter estimation, but I have read quite a bit of statistical literature in this area and if you do the same I think you will find that statisticians don’t even consider the possibility of binning and fitting regressions, because it is so obviously disconnected from the question at hand.

Frequency distributions for ecologists III: Fitting model parameters

Summary

Don’t bin you’re data and fit a regression. Don’t use the CDF and fit a regression. Use maximum likelihood or other statistically grounded approaches that can typically be looked up on Wikipedia.

A bit more detail

OK, so you’ve visualized your data and after playing around a bit you have an idea of what the basic functional form of the model is. Now you want to estimate the parameters. So, for example, you’ve log-transformed the axes and you’re distribution is approximately a straight line so you think it’s a power-law and you want to estimate the exponent of the power-law (more about figuring out if this is actually the best model for you data in the next and final installment).

Ecologists typically fall back on what they know from an introductory statistics class or two to solve this problem – regression. They bin the data, count how many points occur for each binned set of x values and use those numbers for the y data and the bin center for the x data. They then fit a regression to these points to estimate parameters. You can’t blame us as a group for this approach because we typically don’t received training in the proper methods and we’re just doing out best. However, this approach is not valid and can yield very poor parameter estimates in some cases.

The best approach actually varies a bit depending on the specific form of the function, but generally what you want to do is maximum likelihood estimation (MLE). This approach basically determines the values of the free parameters that are most likely (i.e., have the greatest probability) of producing the observed data. Technically what you really want is actually the minimum variance unbiased estimator (MVUE), which will often be a slightly modified form of the maximum likelihood estimate. The bad news is that most ecologists won’t want to calculate the MLE themselves. The good news is that equations for calculating the MLE (or the MVUE) are readily available, and once you have the equation this approach is actually far less work than binning and fitting lines to data. For most common distributions you can just google the name of the distribution. The first link will typically be the Wikipedia entry. Just go to the Wikipedia page, scroll down to Parameter estimation and use the equation provided (and yes Wikipedia is quite reliable for this kind of information). Even if the best approach isn’t maximum likelihood this will give you typically provide you with the best approach (or at least a very good one). If you’re doing something more complicated there is an excellent series of reference books by Norman Johnson and colleagues that covers parameter estimation for most characterized distributions in detail. Some of these solutions will require the use of numerical methods. This is fairly straightforward in R and Matlab, but if that’s a bit much you should be able to find a statistician or quantitatively inclined ecological colleague to help (they’ll be impressed that you got as far as you did).

Recommended reading (focused on power-laws because that’s what I’m most familiar with)

Newman et al. 2005, Edwards et al. 2007White et al. 2008, Clauset et al. 2009

Frequency distributions for ecologists IIb: CDFs and Kernel density estimates

Beyond simple histograms there are two basic methods for visualizing frequency distributions.

Kernel density estimation is basically a generalization of the idea behind histograms. The basic idea is to put an miniature distribution (e.g., a normal distribution) at the position of each individual data point and then add up those distributions to get an estimate of the frequency distribution. This is well developed field with a number of advanced methods for estimating the true form of the underlying frequency distribution. Silverman’s 1981 book is an excellent starting point for those looking for details. I like KDE as a general approach. It has the same issues as binning in that depending on the kernel width (equivalent to bin width) you can get different impressions of the data, but it avoids the issue of having to choose the position of bin edges. It also estimates a continous distribution which is what we are typically looking for.

Cumulative distribution functions (CDF for short) characterize the probability that a randomly drawn value of the x-variable will be less than or equal to a certain value of x. The CDF is conveniently the integral of the probability density function, so if you understand the CDF then simply taking it’s derivative will give you the frequency distribution. The CDF is nice because there are no choices that need to be made to construct a CDF for empirical data. No bin widths, no kernel widths, nothing. All you have to do is rank the n observed values of x from smallest to largest (i=1 . . . n). The probability that an observation is less than or equal to x (the CDF) is then estimated as i/n. If all of this seems a bit esoteric, just think about the rank-abundance distribution. This is basically just a CDF of the abundance distribution with the axes flipped (and one of them rescaled). Because of the objectivity of this approach it has recently been suggested that this is the best approach to visualizing species-abundance distributions. I’ve spent a fair bit of time working with CDFs for exactly this reason. The problem that I have run into with this approach is that except in certain straightforward situations it can be difficult/counterintuitive to grasp what a given CDF translates into with respect to the frequency distribution of interest.

To construct the CDF, first rank the n
observed values (xi) from smallest to largest (i ¼ 1 . . . n). The probability that an observation is less than
or equal to xi (the CDF) is then estimated as i/n

The take home message for visualization is that any of the available approaches done well and properly understood is perfectly satisfactory for visualizing frequency distribution data.

Follow

Get every new post delivered to your Inbox.

Join 151 other followers