For the past few years I’ve been involved in a collaboration to put together a broad-coverage life history database for mammals, reptiles, and birds. The project started because my collaborator, Nathan Myhrvold, and I both had projects we were interested in that involved comparing life history traits of reptiles, mammals, and birds, and only mammals had easily accessible life history databases with broad taxonomic coverage. So, we decided to work together to fix this. To save others the hassle of redoing what we were doing, we decided to make the dataset available to the scientific community. While this post started out as a standard “Hey, check out this new publication from our group” post (Here it is, by the way: Myhrvold, N.P., †E. Baldridge, B. Chan, D. Sivam, D.L. Freeman, S.K.M. Ernest. 2015. An Amniote Life-history Database to Perform Comparative Analyses with Birds, Mammals, and Reptiles. Ecology 96:3109), I’ve realized that there’s something more important that needs to be discussed: what is the future of trait databases?
Trait databases are all the rage these days, for good reason. Traits are interesting from evolutionary and ecological perspectives: How and why do species differ in traits, how do traits evolve, how quickly do traits change in response to changing environment, and what impacts do these differences have on community assembly and ecosystem function. They have the potential to link individual performance with local, regional, and even global processes. There’s lots of trait data out there, but most of it has been buried in papers, books, theses, gray literature, field guides, etc. This has led to the explosion of compendiums compiling trait data. Some of these are published as Data Papers (e.g.: Mammals: Jones et al 2009 , Plankton: Kremer et al 2014) or on-line databases (e.g. AnAge, FishBase), which are open for everyone to use. Many of these open datasets are generated by a small number of scientists to address some particular question. Some are quasi-open/quasi-private resources generated by consortiums of scientists (TRY).
There are a variety of issues regarding these trait compendiums, not least of which is these trait compendiums pull data from numerous sources, but how do data generators get credit and what type of credit is reasonable? This is a doozy that I don’t have an answer to. Instead, my focus today is on the eventual endgame of trait databases. No trait database currently being produced has all the trait data of interest for every species. This means we have a bunch of incomplete data products running around. So, every few years, a bigger – more complete, but still incomplete – trait dataset is produced for some group of species. Sometimes the bigger dataset replicates the effort of the smaller one, sometimes it incorporates the smaller compilation whole-cloth, sometimes they have little overlap in sources whatsoever. Data compilations vary in the ease of use and accessibility. Some databases are widely known, some are known only to a few insiders. I could keep going. Clearly this state of affairs is less than optimal for rapid progress in studying traits.
So what’s the end game here? What should we be doing? In my opinion, what we need is a centralized trait database where people can contribute trait data and where that data is easily accessible by anyone who wants to use it for research (not just to the contributing members of the database). It would also be nice if people who contribute significant amounts of data (no, I’m not going to define that here) could get specific credit for that contribution – maybe as a Data Paper or E-Publication. To encourage people to not just download data, add to it, and then sit on the expanded dataset, embargoes could be put in place to allow people to add their data to the dataset but have the data protected for a limited period of time to allow that researcher to get first crack at the publications using that entry. It’d be really nice if people who use the database could easily download all the references for the data they used so it can be easily incorporated into a literature cited section. The central database could get credit (let’s face it, it needs to be able to justify the funding that such an endeavor would require) by having people register papers published using data from the database. They could then keep track of numbers of pubs and citations to those pubs to help track the database’s impact.
Right about now, my Paleo brethren may be thinking “this sounds suspiciously familiar”. I’ve pretty much lifted this list right off of the Paleobiology Database website (https://paleobiodb.org/#/faq). While ecologists have been running our every database for itself experiment on Trait Databases, the Paleobiologists have been experimenting with collaborative open databases for fossil records. I’m an outsider, so I don’t really know how the database is perceived within the paleo community, but from the outside I have been a big fan of the database, the work that has emerged from its existence, and the community that surrounds it. Which is why I’ve wondered if ecology could some something similar.
But if we’re going to do this, I think we need to copy something else from the Paleobiology Database: a focus on individual records. Currently, many trait databases focus on a species-level value; what is the average number of offspring per litter? Seed Mass? Average body size? This is a logical place to start building a database if many of the questions are focused on comparing central tendencies across species. But our understanding of traits and the questions we want to ask have evolved. Having any info is still better than no info, but often we need info on variability across individuals within a species or we want to know how the trait might vary with changes in the environment. For this, we need record-level data. By this, I mean that instead of pooling observations to obtain an average for a species, we now often want to know that the average litter size for a species at location X is 3 but 8 at location Y. For some species, traits are especially sensitive to temperature or some other environmental variable – so knowing if the body size was measured at 28C or 32C can be important. This data could then be summarized in whatever way the user needed (species-averages, region-specific averages, etc). This, of course, is the hard part, because while we have an increasing number of trait compilations, they have either jettisoned the record information, or little of the record info is associated with the datapoint except maybe the citation name (I say this knowing I’m guilty of this). It also involves doing some form of georeferencing if we want the location info to be useable (like they’ve been doing for museum records). This means we would need to basically uncompile the compilations – find the original citations, extract as much info as we can from them, and then re-enter them as part of a more sophisticated database. This is an extraordinary amount of work that (to be clear) I am not volunteering for.
There are undoubtedly some in the trait community who are about to explode because they’ve been thinking “but we’re doing what you are talking about!”. There are indeed already some bigger initiatives out there (AnAge, FishBase, TRY) but they are either not community-based (i.e. run by a closed group), taxon-centric, or a nightmare of open and closed policies that make extracting data needlessly burdensome, or some unfortunate combo of the above. The one that seems closest to the Paleobiology Database model is TraitBank at the Enyclopedia of Life. Its goal, however, is different from the record-based trait database that I outlined above. Its goal is to have a webpage (and trait data) for every species on the planet, so this still seems to be a species average approach. As I mentioned before, some info is better than no info, so this alone would be a huge benefit to trait research, but still carries the restrictions of species-average values. On the plus side, data in the database is available for everyone to use and each data entry has the specific reference listed with it. But I don’t think it’s had broad buy-in from the trait community. TraitBank only lists 50 data sources and 327 “content partners” (websites/databases that have agreed to share their data via Encyclopedia of Life pages). Admittedly, these sources are some of the biggest data aggregations around, but it’s inconceivable that they cover the wide array of trait info for all of life. Without broad buy-in from the trait community, both using it for research and contributing their data to it, I don’t see this working in the way I’ve outlined above.
So where does this leave us? Well, things are currently in a muddle with respect to trait data, but there’s also tremendous opportunity for someone who can envision the type of database the field needs, sell broad swaths of the trait data community on its importance, and figure out how to build both the database and the community to support and use it. This may involve better community buy-in with TraitBank and/or some new initiative working on a record-level product that would allow a finer-level of question to be asked. The question is how does this happen and is there enough will in the trait community to give up on the current idiosyncratic ad hoc approach and contribute to something with broad trait and taxonomic coverage with an open data policy?
I agree with this; we don’t need trait databases, we need a single trait database. The problem is deletion: everyone thinks a new thing is great, but merging two into one and killing off one of the ancestors is the trick. I think this is an area where a stick approach is needed; people grumble about uploading data to GenBank, but we all do it b/c journals and now funders (NSF data management plan) basically require it, to the benefit of us all. If someone said, “My sequences will be uploaded to my own database on my website, rather than the standard GenBank” it presumably wouldn’t fly. Without deletion and merging, we end up with an XKCD standards issue: https://xkcd.com/927/ .
Nice point about the issue of making the previous version obsolete. It definitely stymies progress in this area. With the new life history database, I made one of my previous versions obsolete (Ernest 2003) and honestly, I expect someone to make this current version obsolete – maybe in a few years but maybe tomorrow. Who knows. This means we have a lot of duplication of effort that is a waste of time, energy, and resources. But how do you transition to a more communal state. GenBank is the stick, but the PBDB (PaleoBiology DataBase) is not. I’d love to know more about how that came about and how they generated community by in and also whether my perception of community by-in is actually correct since I’m an outsider and could be missing something.
“I’d love to know more about how that came about and how they generated community by in and also whether my perception of community by-in is actually correct since I’m an outsider and could be missing something.”
I’ll let some paleontologist older than me comment on those details, but I will note that the PBDB is largely an outgrowth of attempts by individuals or small groups of researchers, going back to the ’70s, to collect large amounts of data on fossil taxonomic occurrences from the literature. The PaleoDB was originally stitched together from several of those older, non-public databases. So, maybe decades of inertia may be a factor here?
The PaleoDB itself is also more than a decade old. Many paleobiologists were initially very derisive of the PaleoDB and work produced from it, and many still are. Anecdotally, the reactions I get from other paleobiologists is very mixed, about 60% of those I talk to support it and use data from it frequently, and 40% despise it and will have nothing to do with data downloaded from it (but this probably reflects the taxonomic groups I work on, mainly). So, while I’d say it does have a lot of community buy-in currently, I don’t think its battles are over.
I think one of the major factors promoting its use was a yearly workshop for graduate students on obtaining and analyzing PaleoDB data, and some of those participants have become important contributors, so the younger generation of paleobiologists have much more buy-in than the older generation.
It takes a core group of people who are willing to put in the time and effort to populate the database with information. More than half of the occurrences in the PBDB were contributed by five people (as well as their students and paid assistants). It especially takes one person (John Alroy for the PBDB) who is dedicated and will drive the project through all of the ups and downs until it becomes established, which may take years. In general, people won’t contribute out of the goodness of their heart. Tangible benefits, like credit or even co-authorship if large amounts of their contributions are used, are a good reward. The PBDB also offered data analysis tools (paleogeographic mapping of occurrences, statistical methods for diversity calculation) that might otherwise have been beyond the capabilities of an individual data contributor.
Paleontologists certainly complain about errors and omissions in the PBDB (welcome to the world of compilations made by humans!) but I’d say that most research uses PBDB data when it is relevant to the topic (diversity, extinction, biogeography, etc.). Occasionally people will compile their own small database on a specific topic, but that seems to be getting less common. There are people who will never use the information because they feel it is too inaccurate, but they also tend to do research where the database isn’t that useful anyways. Data accuracy/completeness is definitely the major challenge once a database is up and running.
Thanks for this post, it speaks speaks my mind in many points. For some reason, up to now, databases often seem to be thought up and compiled with rather short term goals, including the one of “How can I get the most possible co-authorships out of this?”. Of course it is difficult, costly, and very time consuming to collect and organize large amounts of data, but we have to let go of this idea that somebody “owns” data if we really want to see through the promise made by trait basd ecology, i.e. to understand and predict species distributions across scales. That traitbases should start picking up on the fact that intraspecific variability matters is another very important point. Of course this means that databases become ever bigger and more compex, but its not like we lack the technical capabilities to achieve this. One additional point would be that we need free and cross taxa data not only on traits, but also on distributional data, from small scale local samples, up to large scale grid data and distributional maps.
Thanks for your post and for awesome database. I think this is spot on, particularly the emphasis on individual records over species values. One challenge that comes to mind is what to do with all the information in floras and faunas (and species descriptions). They are tremendous resources and are, at least for plants, a source for many of the data points in current trait databases. It is often unclear how many specimens have been measured for any given record. Additionally, for some types of traits, like plant height, it is useful to have a large number of records to have a good estimate of the mean and variance; but for others, such as woody vs. herbaceous, one record is (usually!) enough. I was wondering if you had any thoughts about how to accommodate such “records” in your ideal record-based database.
Thanks again for all your work collating and curating this data.
Awesome database and awesome post!
You’re pretty much preaching to the choir here. But I agree strongly about the need to go individual level.
I know past discussions around master trait databases center around whether to have a strong typology on which traits are included (i.e. a list) or to let people add traits as they have them. The first eliminates a lot of data and is restricted by the imagination of the database authors. The second approach pretty quickly leads to chaos finding all the instances of a trait. Semantics challenges are hard too – is metabolism measured by VO2 the same as metabolism measured by doubly labelled isotopes? And is the definition of “basal” or “max” metabolic rate comparable across different sources?
Not to take away from John Alroy’s (and others) enormous contributions to paleobiodb, but I think the paleo community has had a stronger proclivity to build collective databases for decades. My speculation is that this is because it is a smaller community. Generally altruism is easier to build in smaller communities.
Thanks everyone for commenting! I’ll take them in order:
@dwbapst & Clapham: Thanks for providing some PBDB insights. Seems like 2 key components: 1) core motivated group dedicated to getting it off the ground and established, 2) providing opportunities to do things that would be difficult for the average scientists in the field to do without the communal resource. I dithered over whether getting buy-in specifically from the younger generation was a critical 3rd component. I think it might be because these communal databases are a huge culture shift, so they are the ones most likely to embrace and use something so novel.
@lars: the view of data ownership in ecology is definitely a major issue, and not just for trait databases. I purposely skated around that one because it wasn’t a debate I felt like having (again). But until we get that sorted out, there will be a lot of tension and impediments in the field over this.
@pennell & lars: When I decided to mention the individual-level trait data, I honestly wasn’t sure if anyone would see any point to that since it’s just not something we’ve been doing or thinking about. So I’m glad I’m not alone in thinking this would open up a lot of different questions. As for, how many records would we need? I guess my response is “everything we can get our hands on”. There are traits with a lot of individual/population variability, traits that have little to no individual/population variability, and traits we assume lots about but have no idea what their actual variability is. But I think you could use the info coming in from the individual records as they are entered to help assess which traits need more individual/population level data and which seem to have found their central tendency.
More replies coming shortly.
@pennell: As for how would I actually deal with the more fixed traits on a record basis? Someone with more sophisticated database experience probably should weigh in here. My naive thought was that if there are truly traits that are fixed at the species-level (i.e. no variability in value) then that is stored in a separate table in the database from the individual records. They could be linked during data extraction in a relational database through species codes, or used in queries to subset or group the individual records. But like I said, I have experience with relational databases and SQL but I am not expert in serious database design – which I think an individual record approach would entail.
@McGill: Your metabolic example was actually on my mind when I was writing this post, but I decided that I would just steer clear of the complications of implementing my dream. One – potentially horrific but draconianly pragmatic – way of dealing with these types of issues is by letting people enter the data they have but then converting it to what we want to store in the database. This is uncontroversial for unit conversions, but what if I want to enter a resting metabolic rate but we only store basals? Do we try to create a statistical relationship that allows us to “convert” the resting to basal and then provide that along w/ the actual measured basal metabolic rates? What if that statistical relationship is controversial or has a low r2? For our Amniote data base, we did unit conversions but we deliberately did not convert lengths to masses (even though we really wanted the masses). Instead we provided the data in whatever format we had, which is why we have separate length and mass columns, and decided that I someone really wanted to convert the length to a mass, they could decide for themselves how they were comfortable doing it. But I was involved in another database (not yet published) where we really needed mass but the only way to get mass was often through the use of allometric relationships. If that one ever gets published, it will be the estimated masses that are provided because otherwise it would be too complicated for most people to be able to do on their own. I’ve got more hair raising complications as well, but no sense depressing people. The truth is none of the hair raising complications are fatal, but I think it will require some serious people giving these matters serious thought. Which goes back to the points by Clapham and dwbapst: it really needs a motivated core.
As an outsider, I agree that paleo community seems to be further along on this whole cooperation track than we “neo” ecologists are. Perhaps it’s a community size thing, but community is partly a definition issue as well. Ecology is large, but the trait community is smaller and many of us are clearly running into the same frustrations and road blocks. In my mind there are two burning questions: a) is there a core group of individuals in the trait community who are motivated to tackle this and b) are there enough frustrated trait ecologists out there to create a cooperating community to participate in it?
Call for an RCN proposal in 3…2…1…?
You have hit on all of the critical points that I can think of, and I’m deeply enmeshed in them as I try to build an Avian Diet Database based on raw diet data of individuals or populations instead of species level declarations of “insectivore/omnivore”. Doing so makes it about 20 times slower in my estimation, and the only way I can imagine this getting built for a large biota is by getting other dedicated individuals involved which takes networking, workshops, some funding, etc.
One nice example of a useful way forward is the globalbioticinteractions.org project which absorbs a wide range of datasets in a variety of formats and uses ontologies to make sure these disparate sources can be ingested, integrated, compared, etc. This means that predator-prey, plant-pollinator, host-parasite, and all manner of biotic interactions and their associated data can all be stored and then evaluated in a common framework. So there is some hope that with funding for one or a handful of talented computational ecologists/software engineers something similar could be developed that would work for the vast majority of trait data. But even with the good will and support of dozens (scores?) of ecologists with data (or time) to contribute, I think it would take a small grant plus a dedicated Alroy-equivalent to build the groundwork for the Next Step.
Ah, the Paleobiology Database: we began in a dark, smoke filled pub in Liverpool with a dream…. wait, sorry, that was something else. We started in bright, smoke-free hotel near NCEAS in August of 1998 in a meeting organized by John Alroy and Charles Marshall. We set off with a very particular goal in mind: compiling occurrence (incidence) data of fossil species to reassess historical diversity patterns after taking into account temporal variation in opportunities to find fossils. Part of what made the PaleoDB work was that John and Charles got a wide-range of research programs involved: we had people focused on diversity dynamics, but also people focused on the effects of sampling for biostratigraphy, paleoecology, functional biology and even phylogenetics. Another part that made it work was that John & Charles brought in a wide career range: yes, there were a lot of youngsters at the first meeting (yes, we were young in 1998!) but we also had veterans such as David Raup, Jack Sepkoski (both sadly now deceased) and Richard Bambach (alive and kicking!) and mid-career types such as David Jablonski and Scott Lidgard. And, of course, John and Charles both are pretty damned brilliant themselves: that definitely helped fuel the “vision.”
The fact that we began with a stated goal (reassessing the classic Sepkoski diversity curves after accounting for variable sampling) could have been limiting. However, either by luck or due to the collection of people, it wound up having the opposite effect. We spent our first meeting going through all of the different types of fields we would want: and between the lot of us we came up with reasons to include information about abundances, body sizes, taxonomic histories, rock types, inferred environmental types, etc. As a result, PaleoDB research almost immediately snowballed into methodological studies about the many things that affect sampled richness beyond numbers of species and numbers of localities, the effect of taxonomic practices, etc., and (of course) myriad analyses that used the PaleoDB to test hypotheses and make inferences that had nothing (or little) to do with the classic Sepkoski historical diversity curves.
And that leads to the #1 lesson that I would say we have to offer anyone trying to do something like this: if for any reason someone thinks a data type could be relevant to a general question, then set up fields to include that info: even if it never is relevant to “the” question, then it is going to be relevant to a related question. And lesson #1B is, don’t just get the experts on “the” question: get people doing related stuff that might look at the question differently and/or who are using similar data for very different questions. That gives you the opportunity to serendipitously spin off really cool projects.
@hurlbert: Allen was that you offering to lead an RCN on this? That is awesome! Thanks for stepping up… 🙂 I’ll have to look more at globalinteractions.org. Seems like there are two main avenues for data collection: entry of data by data holders and scraping of data from published sources. Most of my work has been on the (manually) scraping end of things, but more automated scraping would be great. Did you and Ethan just do a bunch of data scraping from PDFs? I seem to remember a lot of regular expressions a few weeks ago (and a few colorful ones as well).
@Wagner: Pete, thanks for the great insights into the beginning of the PBDB. I had no idea that the PBDB started with specific research questions in mind – though that makes sense and your right in many ways it makes it remarkable that the group was able to keep the database so flexible. I like your lessons:
1 “if for any reason someone thinks a data type could be relevant to a general question, then set up fields to include that info: even if it never is relevant to “the” question, then it is going to be relevant to a related question” Every time I’ve done a compilation I’ve pondered and rejected a range of data fields to add to the database and I have always regretted that decision.
1B: “don’t just get the experts on “the” question: get people doing related stuff that might look at the question differently and/or who are using similar data for very different questions” I was just in a conversation bemoaning a specific dataset and why they treated some important data in a manner that made it hard to work with and the answer was clear, they didn’t have anyone involved who understood why anyone would want/need that data.
This is from Jens Kattge and Gerhard Boenisch, the TRY database managers.
The blog and follow up-posts criticize, amongst others, the TRY initiative (https://www.try-db.org). However, at least with respect to TRY, the critique seems to be based on incorrect or outdated information. We would therefore like to take this opportunity to provide a better picture of the TRY database.
One point however is true: TRY only contains plant trait data. The point is, this is our expertise and the data in TRY are highly curated (species, all traits with more than 1000 entries and important auxiliary data are standardized; duplicates and outliers are flagged, etc.). This kind of curation only makes sense, if you know what you are doing. We try however to become the one stop for plant trait data, although it is naive to believe it will ever be possible for one database to contain all plant trait data.
TRY does not restrict the data it accepts: any trait or auxiliary data is accepted, currently more than 1100 traits and 2000 kinds of auxiliary data. The data in TRY are based on individuals and the relation between traits measured on the same individual is conserved.
The data do not belong to the TRY initiative, but remain in possession of the providers who decide how their data are made available. We consider this the best balance between their rights and those of the data users. And this is the only way to get many of the data. The TRY database currently contains 5.6 million trait records for 100.000 species. This would not be possible without strong support from the plant trait community.
And last: Since 12/2014 EVERYONE has the right to obtain data from TRY. Most of the trait records are publicly available. We have already released data for more than 1000 requests, most of them in 2015, which included more than 100 million trait records. 63 publications have used trait data from TRY already.
Pingback: Friday links: Mark McPeek vs. grade inflation, Endurance, #LOTRyourResearch, and more | Dynamic Ecology
Hi Morgan; let me be contrary for a moment[ maybe longer]; A bird species may have a clutch size, or maybe a clutch size matched to some identified ‘ environmental variables’ [ is latitude a variable?], but a fish species, with indeterminate growth, does not have a life history. Its worse since simply recording variation in a fish species LH tells one nothing. nothing.
Rainbow trout, the ‘royalty’ of fish if any species is, ranges from 5″ length at maturity to maybe 25″, depending upon ???????. That is a 125X range difference in body-mass at maturity [ 25/5 cubed], and since every rate process in biology is body-mass dependent, to even speak of a rainbow trout LH is nonsense. Understanding the LH variation for Rainbows is not limited by absence of a central data base, but by lack of understanding of ?????? and how ????? forces the life history.( there is a not-dumb idea that variation in body size growth potential controlled by the environment plays a big role, and this must be part of it. But what the heck is the environment?)
it makes no sense to me to discuss whether between pop variation in LH ‘traits’ should be put in a data base without thinking very deeply about what the variation might mean. And listing oodles of ‘environmental’ variables will not help. Not even for data exploration.
A recent paper by a big consortium of scientists looked at phylogeny of fish, using some very fancy stat models [ way beyond me], and concluded that change in body size went along with speciation in fish. They used FISHBASE to assign rainbows, and indeed each species, a single length. a single length! Of course by doing so they just assumed away the within species, between pop variation in body size . Go figure.
@ Jens and Gerhard,
Thanks for stopping by and telling us more about TRY. However, I think you read more into my post than I was trying to convey. My post is not a critique of TRY but is pointing out what I think would be a productive route forward for trait databases in general as they continue to evolve. If there is a critique implied in the post it is aimed at all of us who have generated trait databases because we all fall short of where I think we need to go (myself included). There are many things about the TRY database that are in line with where I think trait databases need to go – curation, individual-based entries, community involvement of experts, etc. My post does not say (nor imply) anywhere that TRY does not have these things.
I did point out that TRY is not fully open data (“quasi-open, quasi-closed” is my quote). Open data is a specific concept (https://en.wikipedia.org/wiki/Open_data) which TRY does not meet. The role of data in science is contentious, and TRY made a specific decision on this issue that some people agree with and others (like myself) do not. TRY has become more open over the years, which is something I and many other people applaud and I hope is a trend that continues because TRY is an important endeavor but one that I think will not reach its full potential until the open data issue is resolved. But that’s just my opinion.
@ a correspondent
Thanks for your comments on the issues of life history and indeterminate growers. It is actually exactly that issue that brought me to my realization that we need to move away from species averages and towards a database of individual records where we can link important info like: clutch size: 4, when maternal size: 40 cm, and temperature: 24C. Only when we have that info for a bunch of populations do I think we can really get a handle on the within species sensitivity of the life history and the actual variance it displays in nature. Clearly data alone is not the answer. We also need theoretical development (or very deep thinking – I love that statement by the way). But appropriate data is not only a key component for the develop theory – test it – rethink theory process but for some of us it is also a jumping off point for the deep thinking. I think the species-level approach hinders that as you point out. This is why I think integrating individual/population level data into one place (in a way where we can still work with the data at the individual/population level) will improve our ability to ask and answer important questions.
Pingback: Trait databases: the desirable and the possible › Mola Mola