[Preprint] Nine simple ways to make it easier to (re)use your data

I’m a big fan of preprints, the posting of papers in public archives prior to peer review. Preprints speed up the scientific dialogue by letting everyone see research as it happens, not 6 months to 2 years later following the sometimes extensive peer review process. They also allow more extensive pre-publication peer review because input can be solicited from the entire community of scientists, not just two or three individuals. You can read more about the value of preprints in our preprint about preprints (yes, really) posted on figshare.

In the spirit of using preprints to facilitate broad pre-publication peer review a group of weecologists have just posted a preprint on how to make it easier to reuse data that is shared publicly. Since PeerJ‘s commenting system isn’t live yet we would like to encourage your to provide feedback about the paper here in the comments. It’s for a special section of Ideas in Ecology and Evolution on data sharing (something else I’m a big fan of) that is being organized by Karthik Ram (someone I’m a big fan of).

Our nine recommendations are:

  1. Share your data
  2. Provide metadata
  3. Provide an unprocessed form of the data
  4. Use standard data formats (including file formats, table structures, and cell contents)
  5. Use good null values
  6. Make it easy to combine your data with other datasets
  7. Perform basic quality control
  8. Use an established repository
  9. Use an established and liberal license

Most of this territory has been covered before by a number of folks in the data sharing world, but if you look at the state of most ecological and evolutionary data it clearly bears repeating. In addition, I think that our unique contribution is three fold: 1) We’ve tried hard to stick to relatively simple things that don’t require a huge time commitment to get right; 2) We’ve tried to minimize the jargon and really communicate with the awesome folks who are collecting great data but don’t have much formal background in the best practices of structuring and sharing data; and 3) We contribute the perspective of folks who spend a lot of time working with other people’s data and have therefore encountered many of the most common issues that crop up in ecological and evolutionary data.

So, if you have the time, energy, and inclination, please read the preprint and let us know what you think and what we can do to improve the paper in the comments section.

UPDATE: This manuscript was written in the open on GitHub. You can also feel free to file GitHub issues if that’s more your style.

UPDATE 2: PeerJ has now enabled commenting on preprints, so comments are welcome directly on our preprint as well (https://peerj.com/preprints/7/).

15 Comments on “[Preprint] Nine simple ways to make it easier to (re)use your data

  1. Great manuscript! I think it’s written well and flows nicely. Hopefully, this gets well circulated (in preprint and final forms) because it seems like data sharing is really under appreciated outside of genetics and the LTER network.

    I don’t have a ton of experience with archiving or using archived data, so I just have a few brief thoughts on the manuscript.
    1) I felt like the first point on the importance of making your data available could just be part of the introduction, leaving you with 8 points. It’s really a matter of preference and maybe it’s good to have as it’s own point to emphasize it since introductions often just get skimmed.

    2) It might be good to provide a suggestion, even just a sentence, about how to break up tables. You mention linking tables (main field data with latitude, longitude, and weather/climate), but data can be broken up in a variety of ways. Does it make sense to have as much as possible in one table and only break thinks into related tables when otherwise impractical (or collected at different spatial/temporal scales than the primary data). For example, lat and long could be included next to each record or linked with another table so lat and long are just listed for each unique location. The later makes sense for lat and long but there are probably a lot of gray areas. Maybe it doesn’t matter as long as it’s well described in the metadata.

    3) On line 129 you mention MS Excel format. Excel files can be saved as .csv files. It might be worth putting “(.xls, .xlsx)” parenthetically after Excel so people don’t think that excel files saved as .csv are a major problem.

    4) In point 6 it might be useful to state or diagram your recommended format for handling species names.

    Thanks for sharing as a preprint and good luck moving forward.
    -Dan

  2. Thanks for the feedback Dan! You’ve actually hit on a a couple of points that we really struggled with.

    1. Whether point 1 should be it’s own point or a part of the introduction was discussed several times. In the end we decided that it helped emphasize the importance of sharing data better to have it as its own point and that it helped us try to avoid discouraging people from sharing if the following recommendations seemed too difficult. That said, we’ll definitely go back and reconsider this again when revising.

    2. We absolutely agree that this is an important point. The proper general rule from database normalization is to break things up as much as possible to avoid storing and entering duplicate information. So lat’s and long’s should be broken out into a sites table, taxonomic information should be broken out into a species table, etc. This was something we initially planned on adding to the format section and may have even had some text on, but decided that because not splitting things up didn’t really make it harder to reuse data in most cases that it was a level of complexity that we could afford to leave out. However, your comment has me rethinking this and we’ll definitely chat about whether it’s worth adding a few sentences to cover this.

    3. Good point. We’ll definitely add that.

    4. Another excellent point. Having spent a bit too much time parsing species names (with various levels of subspecies, hybrids, unknowns, etc.) out of single columns I really should have thought of that one. Consider it added!

  3. Under “Use good null values” I would add that you should pick a null value that is easily distinguishable from the table delimiter. I have seen tab separated tables where the null value was a space…

  4. As a remote sensing/GIS/landscape ecology researcher, I am often faced with poorly reported location data, which can be a major source of desperation. These suggestions might be too narrow for you article, but feel free to use them if you’d like:

    1) PAY ATTENTION TO YOUR DATUM: the Datum is the theoretical, simplified 3-D representation of the surface of the Earth against which locations are determined. Every single geographical coordinate is measured in relation to a specific Datum, and if that information is omitted, you positional data is almost worthless. In this figure http://www.colorado.edu/geography/gcraft/notes/datum/gif/shift.gif, all points have the exact same coordinate, but notice how much their actual position changes if you change your reference (datum). It is nearly impossible to “guess” the datum for data that doesn’t have it reported, but is very easy to convert data between datums when needed, if the information is present.

    So, make sure you know to which datum your GPS is set to, and make note of it in your field book and your metadata. If you don’t know which one to pick, use WGS-84 (BEFORE recording the locations, of course).

    Later on, after you input the data into a GIS system, please don’t trust geospatial file formats (e.g. shapefiles) to carry on this information. Add a companion text file with metadata, and save yourself and others from future headaches.

    Also note that datum != projection. A map projection is defined by the set of equations with which you take a set of coordinates measured against a 3d ellipsoidal shape (the datum), and redistribute them at relative positions on a plane, ideally preserving some properties of the original data (distances, angles, areas, etc). Changing projections does not change the coordinate values, only their relative position on the plane. But changing datums WILL change you coordinates, for the same geographic location.

    2) Precision: a lot of people prefer to use decimal degrees over degrees:minutes:seconds, as it is easier to input in a spreadsheet. If you do it, be VERY careful with rounding. For example, a degree of latitude/longitude is ~111 km at the Equator. So if you want a 10m precision (a realistic expectation for commercial handheld GPS receivers), you’ll need to record at least four decimal places. If you just write down 55.23º, you’re effectively rounding to the nearest km!

    For this reason, I often recommend using UTM coordinates for everything that is sampled at a local scale. UTM units are given in meters, not degrees, so by writing the full number you’re always sure you have enough precision. As a bonus, UTM coordinates are by convention always measured relative to the WGS84 datum, so you kill two birds with one stone.

    3) Reporting: if you don’t like UTM coordinates (or your study site is unfortunately spread across two UTM zones), and you want to play safe, you’ll write your coordinates in the field as DDº MM’ SS.S” (and make note of the datum!). This notation, however, is unintelligible for pretty much any software, unless you use string parsing. A much better way to report is to have separate columns in your data table, such as lat_deg, lat_min, lat_sec, lon_deg, lon_min, lon_sec. And don’t forget to use negative signs on the degree column, to indicate southern/western hemispheres (instead of N,S,W,E). If you have the data organized like this, you can easily convert it to decimal degrees before feeding it to a GIS or statistical package, using dec_lat = lat_deg + (lat_min/60) + (lat_sec/3600), and the software will give you all the decimal places that you could ever want. No precision lost.

    I hope that helps!

  5. Great job on this! Just on comment – I suggest mentioning persistent identifiers other than DOIs. Although they are the most known and used, the identifier world is big. Folks should know about them as a concept – not just as DOIs.

  6. Really enjoyed the article Ethan, I think it’s extremely helpful to have this (relatively) simple stuff written down so clearly. One minor issue that you may want to consider concerns your recommendation of .csv as a good format for text files. I agree – and it’s the format I use all the time in my own work. However, I also collaborate with people in continental Europe, where commas are used as the decimal separator (i.e., 1.2 is represented as 1,2). This has caused problems before for me, especially when people are converting Excel tables into text files which I then read into something like R. It is easily addressed if you know it’s an issue, but if you’re unaware of it, it can cause headaches! So maybe just adding a sentence somewhere which says that you should state your decimal separator (at least, if it’s not ‘.’) may be useful.

    Would also agree on Daniel Hocking’s point re. taxonomy. I shudder to think how many hours I’ve spent matching species lists. And even when I thought I had the full taxonomy for each species (from Kingdom to Species), I’ve run into problems when someone else has used a non-standard group (Infraorder or whatever). But that’s taxonomists for you…

    Finally, I’ve just learnt a lot from thiagosilva’s comment that I should have known already!

  7. Wow! Thanks for the great feedback everyone. We really appreciate it. This demonstrates one of the most important things about preprints – the opportunity to get early feedback from a bunch of smart folks in an open way.

    Matt – Great point. Consider it done.

    Thiago – Getting into the specific issues of spatial data was another thing we went back and forth about and decided to leave out, but your comment has reminded me of how common an issue this is and it’s important consequences. I think that with a couple of additional sentences in Section 6 we should be able to include the basic message. Do you have a particularly good introductory level citation that you recommend for these issues.

    Carly – Great point. We should be able to tweak the last couple of sentences of Section 8 pretty easily to address it.

    Tom – Great point on the commas. I absolutely never would have thought of this.

  8. Ethan,

    Can’t think of an easy, accessible reference, from the top of my head. Which is probably part of the problem; explanations of these concepts are buried in GIS and Cartography books, and usually are not approached at all in Ecology textbooks, leading to the assumption that geographic coordinates are unique, immutable units.

    I particularly like this book, though: http://goo.gl/w1AWC . It’s cheap, short, and easy to understand, and covers datums, projection, geoids, and the GPS system. It should be on the reference shelf of any lab that acquires spatial data.

  9. Nice Ethan.
    Tidbits: “data are”…
    Agree about commas = troublesome. CSV has come to mean generic delimited files rather than true comma-separated values in a lot of contexts, so recommend TABs.

  10. Thanks @beroe.

    I’m not convinced at this point that Tabs are superior delimiters. Tabs have their own issues: 1) they can display differently in different editors meaning that data that is readable by the person creating the data may be very difficult to view in the raw form by someone else; 2) It can be difficult to tell the difference between tab delimited data, space delimited data, and fixed width data; 3) most importers default to commas as the delimiter. You can get around all of these of course, but it’s its own set of hurdles, particularly for less experienced analysts. We’ll definitely caution against using commas with European style decimals, but I’d need more convincing that tabs are inherently superior.

  11. Really enjoyed following the development of this cool paper. I had never heard of GitHub before this. Been poking around. A post on using it to write a paper would be cool.

  12. Thanks John. I’ve added it to the list (and I’m really hopeful that with sabbatical coming up posts will starting coming off of the list rather than just going on it). In the mean time I definitely recommend Karthik Ram’s recent paper on the value of Git for scientists, including some discussion of its value for writing papers.

  13. Pingback: British Ecological Society journals now allow preprints | Jabberwocky Ecology | Weecology's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: