Some meandering thoughts on the difference between EcologicalData.org and DataONE
In the comments of my post on the Ecological Data Wiki Jarrett Byrnes asked an excellent question:
Very cool. I’m curious, how do you think this will compare/contrast/fight with the Data One project – https://www.dataone.org/ – or is this a different beast altogether?
As I started to answer it I realized that my thoughts on the matter were better served by a full post, both because they are a bit lengthy and because I don’t actually know much about DataONE and would love to have some of their folks come by, correct my mistaken impressions, and just chat about this stuff in general.
To begin with I should say that I’m still trying to figure this out myself, both because I’m still figuring out exactly what DataONE is going to be, and because EcologicalData is still evolving. I think that both projects goals could be largely defined as “Organizing Ecology’s Data,” but that’s a pretty difficult task, involving a lot of components and a lot of different ways to tackle them. So, my general perspective is that the more folks we have trying the merrier. I suspect there will be plenty of room for multiple related projects, but I’d be just as happy (even happier probably) if we could eventually find a single centralized location for handling all of this. All I want is solution to the challenge.
But, to get to the question at hand, here are the differences I see based on my current understanding of DataONE:
1. Approach. There are currently two major paradigms for organizing large amounts of information. The first is to figure out a way to tell computers how to do it for us (e.g., Google), the second is to crowdsource it’s development and curation (e.g., Wikipedia). DataONE is taking the computer based approach. It’s heavy on metadata, ontologies, etc. The goal is to manage the complexities of ecological data by providing the computer with very detailed descriptions of the data that it can understand. We’re taking the human approach, keeping things simple and trying to leverage the collective knowledge and effort of the field. As part of this difference in approach I suspect that EcologicalData will be much more interactive and community driven (the goal is for the community to actually run the site, just like Wikipedia) whereas DataONE will tend to be more centralized and hierarchical. I honestly couldn’t tell you which will turn out better (perhaps the two approaches will each turn out to be better for different things) but I’m really glad that we’re trying both at the same time to figure out what will work and where their relative strengths might be.
2. Actually serving data. DataONE will do this; we won’t. This is part of the difference in approach. If the computer can handle all of the thinking with respect to the data then you want it to do that and just spit out what you want. Centralizing the distribution of heterogeneous data is a really complicated task and I’m excited the folks at DataONE are tackling the challenge.
a. One of the other challenges for serving data is that is that you have to get all of the folks who “own” the data to let you provide it. This is one of the reasons I came up with the Data Wiki idea. By serving as a portal it helps circumvent the challenges of getting all of the individual stake holders to agree to participate.
b. We do provide a tool for data acquisition, the EcoData Retriever, that likewise focuses on circumventing the need to negotiate with data providers by allowing each individual investigator to automatically download the data from the source. But, this just sets up each dataset independently, whereas I’m presuming that DataONE will let you just run one big query of all the data (which I’m totally looking forward to by the way) [1].
3. Focus. The primary motivation behind the Data Wiki goes beyond identifying datasets and really focuses on how you should use them. Having worked with other folks’ data for a number of years I can say that the biggest challenging (for me anyway) is actually figuring out all of the details of when and how the dataset should be used. This isn’t just a question of reading metadata either. It’s a question of integrating thoughts and approaches from across the literature. What I would like to see develop on the Data Wiki pages is the development of concise descriptions for how to go about using these datasets in the best way possible. This is a very difficult task to automate and one where I think a crowdsourced solution is likely the most effective. We haven’t done a great job of this yet, but Allen Hurlbert and I have some plans to develop a couple of good examples early in the fall to help demonstrate the idea.
4. We’re open for business. Ha ha, eat our dust DataONE. But seriously, we’ve taken a super simple approach which means we can get up and running quickly. DataONE is doing something much more complicated and so things may take some time to roll out. I’m hoping to get a better idea of what their time lines look like at ESA. I’m sure their tools will be well worth the wait.
5. Oh, and their budget is a little over $2,000,000/year, which is just slightly larger than our budget of around $5,000/year.
So, there is my lengthy and meandering response to Jarrett’s question. I’m looking forward to chatting with DataONE folks at ESA to find out more about what they are up to, and I’d love to have them stop by here to chat and clear up my presumably numerous misconceptions.
——————————————————————————————————————-
[1] Though we do have some ideas for managing something somewhat similar, so stay tuned for EcoData Retriever 2.0. Hopefully coming to an internet near you sometime this spring.