Some meandering thoughts on the difference between and DataONE

In the comments of my post on the Ecological Data Wiki Jarrett Byrnes asked an excellent question:

Very cool. I’m curious, how do you think this will compare/contrast/fight with the Data One project – – or is this a different beast altogether?

As I started to answer it I realized that my thoughts on the matter were better served by a full post, both because they are a bit lengthy and because I don’t actually know much about DataONE and would love to have some of their folks come by, correct my mistaken impressions, and just chat about this stuff in general.

To begin with I should say that I’m still trying to figure this out myself, both because I’m still figuring out exactly what DataONE is going to be, and because EcologicalData is still evolving. I think that both projects goals could be largely defined as “Organizing Ecology’s Data,” but that’s a pretty difficult task, involving a lot of components and a lot of different ways to tackle them. So, my general perspective is that the more folks we have trying the merrier. I suspect there will be plenty of room for multiple related projects, but I’d be just as happy (even happier probably) if we could eventually find a single centralized location for handling all of this. All I want is solution to the challenge.

But, to get to the question at hand, here are the differences I see based on my current understanding of DataONE:

1. Approach. There are currently two major paradigms for organizing large amounts of information. The first is to figure out a way to tell computers how to do it for us (e.g., Google), the second is to crowdsource it’s development and curation (e.g., Wikipedia). DataONE is taking the computer based approach. It’s heavy on metadata, ontologies, etc. The goal is to manage the complexities of ecological data by providing the computer with very detailed descriptions of the data that it can understand. We’re taking the human approach, keeping things simple and trying to leverage the collective knowledge and effort of the field. As part of this difference in approach I suspect that EcologicalData will be much more interactive and community driven (the goal is for the community to actually run the site, just like Wikipedia) whereas DataONE will tend to be more centralized and hierarchical. I honestly couldn’t tell you which will turn out better (perhaps the two approaches will each turn out to be better for different things) but I’m really glad that we’re trying both at the same time to figure out what will work and where their relative strengths might be.

2. Actually serving data. DataONE will do this; we won’t. This is part of the difference in approach. If the computer can handle all of the thinking with respect to the data then you want it to do that and just spit out what you want. Centralizing the distribution of heterogeneous data is a really complicated task and I’m excited the folks at DataONE are tackling the challenge.

a. One of the other challenges for serving data is that is that you have to get all of the folks who “own” the data to let you provide it. This is one of the reasons I came up with the Data Wiki idea. By serving as a portal it helps circumvent the challenges of getting all of the individual stake holders to agree to participate.

b. We do provide a tool for data acquisition, the EcoData Retriever, that likewise focuses on circumventing the need to negotiate with data providers by allowing each individual investigator to automatically download the data from the source. But, this just sets up each dataset independently, whereas I’m presuming that DataONE will let you just run one big query of all the data (which I’m totally looking forward to by the way) [1].

3. Focus. The primary motivation behind the Data Wiki goes beyond identifying datasets and really focuses on how you should use them. Having worked with other folks’ data for a number of years I can say that the biggest challenging (for me anyway) is actually figuring out all of the details of when and how the dataset should be used. This isn’t just a question of reading metadata either. It’s a question of integrating thoughts and approaches from across the literature. What I would like to see develop on the Data Wiki pages is the development of concise descriptions for how to go about using these datasets in the best way possible.  This is a very difficult task to automate and one where I think a crowdsourced solution is likely the most effective. We haven’t done a great job of this yet, but Allen Hurlbert and I have some plans to develop a couple of good examples early in the fall to help demonstrate the idea.

4. We’re open for business. Ha ha, eat our dust DataONE. But seriously, we’ve taken a super simple approach which means we can get up and running quickly. DataONE is doing something much more complicated and so things may take some time to roll out. I’m hoping to get a better idea of what their time lines look like at ESA. I’m sure their tools will be well worth the wait.

5. Oh, and their budget is a little over $2,000,000/year, which is just slightly larger than our budget of around $5,000/year.

So, there is my lengthy and meandering response to Jarrett’s question. I’m looking forward to chatting with DataONE folks at ESA to find out more about what they are up to, and I’d love to have them stop by here to chat and clear up my presumably numerous misconceptions.


[1] Though we do have some ideas for managing something somewhat similar, so stay tuned for EcoData Retriever 2.0. Hopefully coming to an internet near you sometime this spring.

2 Comments on “Some meandering thoughts on the difference between and DataONE

  1. Hi Ethan,
    We’ll be glad to give you and others a tour of current and future DataONE capabilities at booth 605 during the ESA.
    A few summary responses to your post / questions. Yes, we are taking a centralized computing approach however we do currently engage the broader community, though not currently in your Wiki type example. We have a DataONE Users Group that provides feedback on the CI and resource development of the organization (interested in joining?), we have a dozen or so Working Groups populated by members of the ecological, computing, library and sociological communities and we run workshops inviting participants to contribute to the tools and resources that DataONE is building. I agree that some of this may not be readily transparent on our website, but that too is undergoing some major development in time for public release.
    Which brings me to timeline. Many of our educational / outreach resources are currently available online (e.g our Best Practices and Software Tools database) or will soon be available online (e.g. our collaboration with the DMPTOOL, beta testing at ESA, full production of v1 in September). As such, we are already providing information on the ‘how’ (and ‘why’) of data management and reuse. However, I recognize that people what to know when they can go in and conduct centralized searches and the answer to that is ‘by the end of the year’. We do not have a specific date for release but ‘the end of the year’ does not mean December 31st.
    It seems to me that one of the primary differences between the Ecological Data Wiki and DataONE is in serving up the data, as you describe it. The Ecological Data Wiki is going to be a great resource in identifying data sets and data repositories with less resources invested in the search and integration capabilities? Perhaps not surprising given the differences in budget? DataONE will also provide a database of known repositories but we will provide a centralized search interface across repositories with increasing data integration and analysis capabilities in future versions of production. Additionally, by creating a coordinated network of Member Nodes (data repositories among them) we aim to provide secure and persistent access to data (if a repository server goes down, we will be able to get you the data from a replicated location).
    Hopefully this answers some of your questions, or perhaps elicits more.
    See you next week,
    Amber Budden
    Director for Community Engagement and Outreach, DataONE

  2. I think another big difference between DataOne and is ease of use.

    I worked on one of the precursors to DataOne, the HydroServer project ( – some of the people involved in designing HydroServer are now working on DataOne. HydroServer works (and if I’m not mistaken, DataOne will as well) by allowing organizations to create their own servers that broadcast data, accessible by web services. For examples, take a look at or – two servers running software I developed for HydroServer.

    Basically, to add data to the DataOne network, you’ll need to set up your own server, set up the DataOne software, and open your server to the public, dealing with the associated hardware costs and security issues and hiring server administrators and software developers along the way. To use the data, you’ll install and use DataOne software tools. For the types of data already available via the EcoData Retriever, this approach would be serious overkill. With the EcoData Retriever, access is provided to data that already exists, and once you have access to that data in a local database, you can access it any number of ways.

    DataOne is creating a new, somewhat complex infrastructure for organizations to publish data in a common format; takes a lot of data that’s already available in various different formats and simply makes them accessible.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: