Data Retriever 2.0: We handle the data so you can focus on the analysis
We are very exited to announce a major new release of the Data Retriever, our software for making it quick and easy to get clean, ready to analyze, versions of publicly available data.
The Data Retriever, automates the downloading, cleaning, and installing of ecological and environmental data into your choice of databases and flat file formats. Instead of hours tracking down the data on the web, downloading it, trying to import it, running into issues (e.g, non-standard nulls, problematic column names, encoding issues), fixing one problem, and then encountering the next, all you need to do is run a single command from the command line:
$ retriever install csv iris $ retriever install sqlite breed-bird-survey -f bbs.sqlite
or from R:
>>> rdataretriever::install('postgres', 'wine-quality') >>> portal_data <- rdataretriever::fetch('portal')
The Data Retriever uses information in Frictionless Data datapackage.json files to automatically handle all of the complexities of “simple” data for you. For more complicated complicated datasets, with dozens of components or major data structure issues, the Retriever uses Python scripts as plugins to handle the major data cleaning work and then automatically handles the rest.
To find out more about the Data Retriever checkout the websites, the full documentation, and the GitHub repositories for both the Data Retriever and the R Data Retriever package.
Expanded focus and name change
For those of you familiar with the EcoData Retriever, this is the same software with a new name. Challenges with the data end of the analysis pipeline occur across disciplines and our tools work just as well for non-ecological data, so we’ve started adding non-ecological data and changed our name to reflect that. We’d love to hear from anyone interested in leading a push to add data from another discipline or just interested in adding a single favorite dataset.
As part of this we’ve changed the name of the R package from ecoretriever
to rdataretriever
.
Major changes
The 2.0 release includes a number of major changes including:
- Python 3 support (a single code base runs on both Python 2 and 3)
- Adoption of the frictionless data datapackage.json standard (replacing our old YAML like metadata system), including a command line interface for creating and editing datapackage.json files
- Add json and xml as available output formats
- Major expansion of the documentation and hosting of the documentation at Read the Docs
- Remove the graphical user interface (to allow us to focus that development time on wrappers for other languages)
- Lots of work under the hood and major improvements in testing
- Broaden scope to include non-ecological data
We are also in the process of releasing version 1.0 of the R package. This version adds the new features in the Data Retriever and also includes major stability improvements, in particular in RStudio and on Windows.
We also have a brand new website.
Upgrading to the new version (UPDATED)
To ensure the smoothest upgrade to the new version we recommend:
- Run
retriever reset scripts
from the command line - Uninstall the old version of the EcoData Retriever
- Install the new version
- Run
retriever update
from the command line
Acknowledgments
Henry Senyondo is the lead developer for the Data Retriever and has done an amazing job over the past year developing new features and shoring up the fundamentals for the software. He lead the work on 2.0 start to finish.
Akash Goel was a Google Summer of Code student with the project last summer and was responsible for the majority of the work adding Python 3 support and switching the project over to the datapackage.json
standard.
Dan McGlinn, the creator of the R package, has continued his excellent leadership of the development of this package. Shawn Taylor, a new contributor, was instrumental in solving the stability issues on Windows/RStudio.
In addition to these core folks our growing group of contributors to both projects have been invaluable for adding new functionality, fixing bugs, and testing new changes. We are super excited to have contributions from 30 different people and will keep working hard to make sure that everyone feels welcome and supported in contributing to the project.
The level of work done to get these releases out the door was only possible due to generous support of the Gordon and Betty Moore Foundation’s Data Driven Discovery Initiative. This support allowed my group to employ Henry as a full time software engineer to work on these and other projects. This kind of active support for the development and maintenance of research oriented software makes sustainable software development at universities possible.
New release of the EcoData Retriever
We are very exited to announce the newest release of the EcoData Retriever, our software for automating the downloading, cleaning, and installing of ecological and environmental data. Instead of hours or days trying to get complicated datasets like the Breeding Bird Survey ready for analysis, the Retriever lets you simply click a button or run a single command from R or the command line, and your computer does the rest.
It’s been over a year since the last retriever release and there are lots of new features and improvements to be excited about.
- We’ve added 21 new datasets, including major ecological and environmental datasets like eBird, Vertnet, and the Global Wood Density Database, and the PRISM climate data.
- To support all of these datasets we’ve added support for additional data types including greater than memory archive files, and we’ve also improved the ability to control where downloaded files are stored and how they are clustered together.
- We’ve significantly improved documentation and now have a new automatically built documentation site at Read The Docs.
- We’ve also made a lot of under the hood improvements.
This is also the first release that has been overseen by Weecology’s new software engineer, Henry Senyondo. We’re excited to have Henry on the team, and now that he’s around development of both the EcoData Retriever and other lab software projects will be happening more quickly.
A big thanks to the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative for funding this development through Grant GBMF4563 and to the National Science Foundation for funding as part of a CAREER award to Ethan White.
UPDATE: Led by Dan McGlinn we also released a new version of the ecoretriever R interface for the Retriever last fall. This makes using the Retriever from R as simple as:
data <- ecoretriever::fetch("BBS")
EcoData Retriever now supports R and environmental data, and has more datasets
We are very excited to announce the newest release of our EcoData Retriever software and the first release of a supporting R package, ecoretriever. If you’re not familiar with the EcoData Retriever you can read more here.
The biggest improvement to the Retriever in this set of releases is the ability to run it directly from R. Dan McGlinn did a great job leading the development of this package and we got ton of fantastic help from the folks at rOpenSci (most notably Scott Chamberlain, Gavin Simpson, and Karthik Ram). Now, once you install the main EcoData Retriever, you can run it from inside R by doing things like:
install.packages('ecoretriever') library(ecoretriever) # List the datasets available via the Retriever ecoretriever::datasets() # Install the Gentry dataset into csv files in your working directory ecoretriever::install('Gentry', 'csv') # Download the raw Gentry dataset files, without any processing, # to the subdirectory named data ecoretriever::download('Gentry', './data/') # Install and load a dataset as a list Gentry = ecoretriever::fetch('Gentry') names(Gentry) head(Gentry$counts)
The other big advance in this release is the ability to have the Retriever directly download files instead of processing them. This allows us to support data that doesn’t come in standard tabular forms. So, we can now include things like environmental data in GIS formats and phylogenetic data such as supertrees. We’ve used this new capability to allow the automatic downloading of the Bioclim data, one of the most widely used climate datasets in ecology, and the supertree for mammals from Fritz et al. 2009.
Finally, we’ve also add the very cool mammalian diet dataset from Dryad
EcoData Retriever: quickly download and cleanup ecological data so you can get back to doing science
If you’ve every worked with scientific data, your own or someone elses, you know that you can end up spending a lot of time just cleaning up the data and getting it in a state that makes it ready for analysis. This involves everything from cleaning up non-standard nulls values to completely restructuring the data so that tools like R, Python, and database management systems (e.g., MS Access, PostgreSQL) know how to work with them. Doing this for one dataset can be a lot of work and if you work with a number of different databases like I do the time and energy can really take away from the time you have to actually do science.
Over the last few years Ben Morris and I been working on a project called the EcoData Retriever to make this process easier and more repeatable for ecologists. With a click of a button, or a single call from the command line, the Retriever will download an ecological dataset, clean it up, restructure and assemble it (if necessary) and install it into your database management system of choice (including MS Access, PostgreSQL, MySQL, or SQLite) or provide you with CSV files to load into R, Python, or Excel.
Just click on the box to get the data:
Or run a command like this from the command line:
retriever install msaccess BBS --file myaccessdb.accdb
This means that instead of spending a couple of days wrangling a large dataset like the North American Breeding Bird Survey into a state where you can do some science, you just ask the Retriever to take care of it for you. If you work actively with Breeding Bird Survey data and you always like to use the most up to date version with the newest data and the latest error corrections, this can save you a couple of days a year. If you also work with some of the other complicated ecological datasets like Forest Inventory and Analysis and Alwyn Gentry’s Forest Transect data, the time savings can easily be a week.
The Retriever handles things like:
- Creating the underlying database structures
- Automatically determining delimiters and data types
- Downloading the data (and if there are over 100 data files that can be a lot of clicks)
- Transforming data into standard structures so that common tools in R and Python and relational database management systems know how to work with it (e.g., converting cross-tabulated data)
- Converting non-standard null values (e.g., 999.0, -999, NoData) into standard ones
- Combining multiple data files into single tables
- Placing all related tables in a single database or schema
The EcoData Retriever currently includes a number of large, openly available, ecological datasets (see a full list here). It’s also easy to add new datasets to the EcoData Retriever if you want to. For simple data tables a Retriever script can be as simple as:
name: Name of the dataset
description: A brief description of the dataset of ~25 words.
shortname: A one word name for the dataset
table: MyTableName, http://awesomedatasource.com/dataset
The Retriever has an installer for Windows, an App for Mac, and a package for Ubuntu/Debian Linux. See the quick explanation of how to get started and then go take it for a spin.
If you’re interested in reading more about the Retriever you can checkout the website or read our paper on the project.
We also have some exciting new features on the To Do list including:
- Automatically cleaning up the taxonomy using existing services
- Providing detailed tracking of the provenance of your data by recording the date it was downloaded, the version of the software used, and information about what cleanup steps the Retriever performed
- Integration into R and Python
Let us know what you think we should work on next in the comments.
How I remember the positive in science and academia: My Why File
Doing science in academia involves a lot of rejection and negative feedback. Between grant agencies single digit funding rates, pressure to publish in a few “top” journals all of which have rejection rates of 90% or higher [1], and the growing gulf between the number of academic jobs and the number of graduate students and postdocs [2], spending even a small amount of time in academia pretty much guarantees that you’ll see a lot of rejection. In addition, even when things are going well we tend to focus on providing as much negative feedback as possible. Paper reviews, grant reviews, and most university evaluation and committee meetings are focused on the negatives. Even students with awesome projects that are progressing well and junior faculty who are cruising towards tenure have at least one meeting a year where someone in a position of power will try their best to enumerate all of things you could be doing better [3]. This isn’t always a bad thing [4] and I’m sure it isn’t restricted to academia or science (these are just the worlds I know), but it does make keeping a positive attitude and reasonable sense of self-worth a bit… challenging.
One of the things that I do to help me remember why I keep doing this is my Why File. It’s a file where I copy and paste reminders of the positive things that happen throughout the year [5]. These typically aren’t the sort of things that end up on my CV. I have my CV for tracking that sort of thing and frankly the number of papers I’ve published and grants I’ve received isn’t really what gets me out of bed in the morning. My Why File contains things like:
- Email from students in my courses, or comments on evaluations, telling me how much of an impact the skills they learned have had on their ability to do science
- Notes from my graduate students, postdocs, and undergraduate researchers thanking me for supporting them, inspiring them, or giving them good advice
- Positive feedback from mentors and people I respect that help remind me that I’m not an impostor
- Tweets from folks reaffirming that an issue or approach I’m advocating for is changing what they do or how they do it
- Pictures of thank you cards or creative things that people in my lab have done
- And even things that in a lot of ways are kind of silly, but that still make me smile, like screen shots of being retweeted by Jimmy Wales or of Tim O’Reilly plugging one of my papers.
If you’ve said something nice to me in the past few years be it in person, by email, on twitter, or in a handwritten note, there’s a good chance that it’s in my Why File helping me keep going at the end of a long week or a long day. And that’s the other key message of this post. We often don’t realize how important it is to say thanks to the folks who are having a positive influence on us from time to time. Or, maybe we feel uncomfortable doing so because we think these folks are so talented and awesome that they don’t need it, or won’t care, or might see this positive feedback as silly or disingenuous. Well, as Julio Betancourt once said, “You can’t hug your reprints”, so don’t be afraid to tell a mentor, a student, or a colleague when you think they’re doing a great job. You might just end up in their Why File.
What do you do to help you stay sane in academia, science, or any other job that regularly reminds you of how imperfect you really are?
[1] This idea that where you publish not what you publish is a problem, but not the subject of this post.
[2] There are lots of great ways to use a PhD, but unfortunately not everyone takes that to heart.
[3] Of course the people doing this are (at least sometimes) doing so with the best intentions, but I personally think it would be surprisingly productive to just say, “You’re doing an awesome job. Keep it up.” every once in a while.
[4] There is often a goal to the negativity, e.g., helping a paper or person reach their maximum potential, but again I think we tend to undervalue the consequences of this negativity in terms of motivation [4b].
[4b] Hmm, apparently I should write a blog post on this since it now has two footnotes worth of material.
[5] I use a Markdown file, but a simple text file or a MS Word document would work just fine as well for most things.
Four basic skill areas for a macroecologist [Guest post]
This is a guest post by Elita Baldridge (@elitabaldridge), a graduate student in Ethan White’s lab in the Ecology Center at Utah State University.
As a budding macroecologist, I have thought a lot about what skills I need to acquire during my Ph.D. This is my model of the four basic attributes for a macroecologist, although I think it is more generally applicable to many ecologists as well:
- Data
- Statistics
- Math
- Programming
Data:
- Knowledge of SQL
- Dealing with proper database format and structure
- Finding data
- Appropriate treatments of data
- Understanding what good data are
Statistics:
- Bayesian
- Monte Carlo methods
- Maximum likelihood methods
- Power analysis
- etc.
Math:
- Higher level calculus
- Should be able to derive analytical solutions for problems
- Modelling
Programming:
- Should be able to write programs for analysis, not just simple statistics and simple graphs.
- Able to use version control
- Once you can program in one language, you should be able to program in other languages without much effort, but should be fluent in at least one language.
General recommendations:
Achieve expertise in at least 2 out of the 4 basic areas, but be able to communicate with people who have skills in the other areas. However, if you are good at collaboration and come up with really good questions, you can make up for skill deficiencies by collaborating with others who possess those skills. Start with smaller collaborations with the people in your lab, then expand outside your lab or increase the number of collaborators as your collaboration skills improve.
Gaining skills:
Achieving proficiency in an area is best done by using it for a project that you are interested in. The more you struggle with something, the better you understand it eventually, so working on a project is a better way to learn than trying to learn by completing exercises.
The attribute should be generalizable to other problems: For example, if you need to learn maximum likelihood for your project, you should understand how to apply it to other questions. If you need to run an SQL query to get data from one database, you should understand how to write an SQL query to get data from a different database.
In graduate school:
Someone who wants to compile their own data or work with existing data sets needs to develop a good intuitive feel for data; even if they cannot write SQL code, they need to understand what good and bad databases look like and develop a good sense for questionable data, and how known issues with data could affect the appropriateness of data for a given question. The data skill is also useful if a student is collecting field data, because a little bit of thought before data collection goes a long way toward preventing problems later on.
A student who is getting a terminal master’s and is planning on using pre-existing data should probably be focusing on the data skill (because data is a highly marketable skill, and understanding data prevents major mistakes). If the data are not coming from a central database, like the BBS, where the quality of the data is known, additional time will have to be added for time to compile data, time to clean the data, and time to figure out if the data can be used responsibly, and time to fill holes in the data.
Master’s students who want to go on for a Ph.D. should decide what questions they are interested in and should try to pick a project that focuses on learning a good skill that will give them a headstart- more empirical (programming or stats), more theoretical (math), more applied (math (e.g., for developing models), stats(e.g., applying pre-existing models and evaluating models, etc.), or programming (e.g. making tools for people to use)).
Ph.D. students need to figure out what types of questions they are interested in, and learn those skills that will allow them to answer those questions. Don’t learn a skill because it is trendy or you think it will help you get a job later if you don’t actually want to use that skill. Conversely, don’t shy away from learning a skill if it is essential for you to pursue the questions you are interested in.
Right now, as a Ph.D. student, I am specializing in data and programming. I speak enough math and stats that I can communicate with other scientists and learn the specific analytical techniques I need for a given project. For my interests (testing questions with large datasets), I think that by the time I am done with my Ph.D., I will have the skills I need to be fairly independent with my research.
[Preprint] Nine simple ways to make it easier to (re)use your data
I’m a big fan of preprints, the posting of papers in public archives prior to peer review. Preprints speed up the scientific dialogue by letting everyone see research as it happens, not 6 months to 2 years later following the sometimes extensive peer review process. They also allow more extensive pre-publication peer review because input can be solicited from the entire community of scientists, not just two or three individuals. You can read more about the value of preprints in our preprint about preprints (yes, really) posted on figshare.
In the spirit of using preprints to facilitate broad pre-publication peer review a group of weecologists have just posted a preprint on how to make it easier to reuse data that is shared publicly. Since PeerJ‘s commenting system isn’t live yet we would like to encourage your to provide feedback about the paper here in the comments. It’s for a special section of Ideas in Ecology and Evolution on data sharing (something else I’m a big fan of) that is being organized by Karthik Ram (someone I’m a big fan of).
Our nine recommendations are:
- Share your data
- Provide metadata
- Provide an unprocessed form of the data
- Use standard data formats (including file formats, table structures, and cell contents)
- Use good null values
- Make it easy to combine your data with other datasets
- Perform basic quality control
- Use an established repository
- Use an established and liberal license
Most of this territory has been covered before by a number of folks in the data sharing world, but if you look at the state of most ecological and evolutionary data it clearly bears repeating. In addition, I think that our unique contribution is three fold: 1) We’ve tried hard to stick to relatively simple things that don’t require a huge time commitment to get right; 2) We’ve tried to minimize the jargon and really communicate with the awesome folks who are collecting great data but don’t have much formal background in the best practices of structuring and sharing data; and 3) We contribute the perspective of folks who spend a lot of time working with other people’s data and have therefore encountered many of the most common issues that crop up in ecological and evolutionary data.
So, if you have the time, energy, and inclination, please read the preprint and let us know what you think and what we can do to improve the paper in the comments section.
UPDATE: This manuscript was written in the open on GitHub. You can also feel free to file GitHub issues if that’s more your style.
UPDATE 2: PeerJ has now enabled commenting on preprints, so comments are welcome directly on our preprint as well (https://peerj.com/preprints/7/).
Some alternative advice on how to decide where to submit your paper
Over at Dynamic Ecology this morning Jeremy Fox has a post giving advice on how to decide where to submit a paper. It’s the same basic advice that I received when I started grad school almost 15 years ago and as a result I don’t think it considers some rather significant changes that have happened in academic publishing over the last decade and a half. So, I thought it would be constructive for folks to see an alternative viewpoint. Since this is really a response to Jeremy’s post, not a description of my process, I’m going to use his categories in the same order as the original post and offer my more… youthful… perspective.
- Aim as high as you reasonably can. The crux of Jeremy’s point is “if you’d prefer for more people to read and think highly of your paper, you should aim to publish it in a selective, internationally-leading journal.” From a practical perspective journal reputation used to be quite important. In the days before easy electronic access, good search algorithms, and social networking, most folks found papers by reading the table of contents of individual journals. In addition, before there was easy access to paper level citation data, and alt-metrics, if you needed to make a quick judgment on the quality of someones science the journal name was a decent starting point. But none of those things are true anymore. I use searches, filtered RSS feeds, Google Scholar’s recommendations, and social media to identify papers I want to read. I do still subscribe to tables of contents via RSS, but I watch PLOS ONE and PeerJ just as closely as Science and Nature. If I’m evaluating a CV as a member of a search committee or a tenure committee I’m interested in the response to your work, not where it is published, so in addition to looking at some of your papers I use citation data and alt-metrics related to your paper. To be sure, there are lots of folks like Jeremy that focus on where you publish to find papers and evaluate CVs, but it’s certainly not all of us.
- Don’t just go by journal prestige; consider “fit”. Again, this used to mater more before there were better ways to find papers of interest.
- How much will it cost? Definitely a valid concern, though my experience has been that waivers are typically easy to obtain. This is certainly true for PLOS ONE.
- How likely is the journal to send your paper out for external review? This is a strong tradeoff against Jeremy’s point about aiming high since “high impact” journals also typically have high pre-review rejection rates. I agree with Jeremy that wasting time in the review process is something to be avoided, but I’ll go into more detail on that below.
- Is the journal open access? I won’t get into the arguments for open access here, but it’s worth noting that increasing numbers of us value open access and think that it is important for science. We value open access publications so if you want us to “think highly of your paper” then putting it where it is OA helps. Open access can also be important if you “prefer for more people to read… your paper” because it makes it easier to actually do so. In contrast to Jeremy, I am more likely to read your paper if it is open access than if it is published in a “top” journal, and here’s why: I can do it easily. Yes, my university has access to all of the top journals in my field, but I often don’t read papers while I’m at work. I typically read papers in little bits of spare time while I’m at home in the morning or evenings, or on my phone or tablet while traveling or waiting for a meeting to start. If I click on a link to your paper and I hit a paywall then I have to decide whether it’s worth the extra effort to go to my library’s website, log in, and then find the paper again through that system. At this point unless the paper is obviously really important to my research the activation energy typically becomes too great (or I simply don’t have that extra couple of minutes) and I stop. This is one reason that my group publishes a lot using Reports in Ecology. It’s a nice compromise between being open access and still being in a well regarded journal.
- Does the journal evaluate papers only on technical soundness? The reason that many of us think this approach has some value is simple, it reduces the amount of time and energy spent trying to get perfectly good research published in the most highly ranked journal possible. This can actually be really important for younger researchers in terms of how many papers they produce at certain critical points in the career process. For example, I would estimate that the average amount of time that my group spends getting a paper into a high profile journal is over a year. This is a combination of submitting to multiple, often equivalent caliber, journals until you get the right roll of the dice on reviewers, and the typically extended rounds of review that are necessary to satisfy the reviewers about not only what you’ve done, but satisfying requests for additional analyses that often aren’t critical, and changing how one has described things so that it sits better with reviewers. If you are finishing your PhD then having two or three papers published in a PLOS ONE style journal vs. in review at a journal that filters on “importance” can make a big difference in the prospect of obtaining a postdoc. Having these same papers out for an extra year accumulating citations can make a big difference when applying for faculty positions or going up for tenure if folks who value paper level metrics over journal name are involved in evaluating your packet.
- Is the journal part of a review cascade? I don’t actually know a lot of journals that do this, but I think it’s a good compromise between aiming high and not wasting a lot of time in review. This is why we think that ESA should have a review cascade to Ecosphere.
- Is it a society journal? I agree that this has value and it’s one of the reasons we continue to support American Naturalist and Ecology even though they aren’t quite as open as I would personally prefer.
- Have you had good experiences with the journal in the past? Sure.
- Is there anyone on the editorial board who’d be a good person to handle your paper? Having a sympathetic editor can certainly increase your chances of acceptance, so if you’re aiming high then having a well matched editor or two to recommend is definitely a benefit.
To be clear, there are still plenty of folks out there who approach the literature in exactly the way Jeremy does and I’m not suggesting that you ignore his advice. In fact, when advising my own students about these things I often actively consider and present Jeremy’s perspective. However, there are also an increasing number of folks who think like I do and who have a very different set of perspectives on these sorts of things. That makes life more difficult when strategizing over where to submit, but the truth is that the most important thing is to do the best science possible and publish it somewhere for the world to see. So, go forth, do interesting things, and don’t worry so much about the details.
UPDATE: More great discussion here, here, here and here. [If I missed yours just let me known in the comments and I”ll add it]
Why your science blog should provide full feeds
People find blog posts in different ways. Some visit the website regularly, some subscribe to email updates, and some subscribe using the blog’s feed. Feeds can be a huge time saver for processing the ever increasing amount of information that science generates, by placing much of that information in a single place in a simple, standardized, format. It also lets you consume one piece of information at a time and keeps your inbox relatively free of clutter (for more about why using a feed reader is awesome see this post).
When setting up their feeds bloggers can choose to either provide the entire content of the post, or just a small teaser that contains just the first few sentences of the post. In this post I am going to argue that science bloggers should choose to provide full posts.
The core reason is that we are are doing this to facilitate scientific dialog, and we are all very busy. In addition to the usual academic work load of teaching, doing research, and helping our departments and universities function, we are now dealing with keeping up with a rapidly expanding literature plus a bloom of scientific blogs, tweets, and status updates (and oh yeah, some of us even have personal lives). This means that we are consuming a massive amount of information on a daily basis and we need to be able to do so quickly. I squeeze this in during small windows of time (bus rides home, gaps between meetings, while I’m running my toddler’s bath) and often on a mobile device.
I can do this easily if I have full feeds. I open my feed reader, open the first item, read it, move on to the next one. My brain knows exactly what format to expect, cognitive load is low, and the information is instantly available. If instead I encounter a teaser, I first have to make a conscious decision about whether or not I want to click through to the actual post, then I have to hit the link, wait for the page to load (which can still be a fairly long time on a phone), adjust to a format that varies widely across blogs, often adjust the zoom and rotate my screen (if I’m reading on my phone), read the item, and then return to my reader. This might not seem like a huge deal for a handful of items, but multiply the lost time by a few hundred or a few thousand items a week and it adds up in a hurry. On top of that I store and tag full-text, searchable, copies of posts for all of the blogs I follow in my feed reader so that I can find posts again. This is handy when I remember there is a post I want to either share with someone or link to, but can’t remember who wrote it.
So, if your blog doesn’t provide full feeds this means three things. First, I am less likely to read a post if it’s a teaser. It costs me extra time, so the threshold for how interesting it needs to be goes up. Second, if I do read it I now have less time to do other things. Third, if I want to find your post again to recommend it to someone or link to it, the chances of my doing so successfully are decreased. So, if your goal is science communication, or even just not being disrespectful of your readers’ time, full feeds are the way to go.
This all goes for journal tables of contents as well. As I’ve mentioned before, if the journal feed doesn’t include the abstracts and the full author line, it is just costing the papers readers, and the journal’s readers time, and therefore making the scientific process run more slowly than it could.
So, bloggers and journal editors, for your readers sake, for sciences sake, please turn on full feeds. It will only take you two minutes. It will save science hundreds of hours. It will probably be this most productive thing you do for science all week.
Michael Nielsen on the importance and value of Open Science
We are pretty excited about what modern technology can do for science and in particular the potential for increasingly rapid sharing of, and collaboration on, data and ideas. It’s the big picture that explains why we like to blog, tweet, publish data and code, and we’ve benefited greatly from others who do the same. So, when we saw this great talk by Michael Nielsen about Open Science, we just had to share.
(via, appropriately enough, @gvwilson and @TEDxWaterloo on Twitter)
Why computer labs should never be controlled by individual colleges/departments
Some time ago in academia we realized that it didn’t make sense for individual scientists or even entire departments to maintain their own high performance computing resources. Use of these resources by an individual is intensive, but sporadic, and maintenance of the resources is expensive [1] so the universities soon realized they were better off having centralized high performance computing centers so that computing resources were available when needed and the averaging effects of having large numbers of individuals using the same computers meant that the machines didn’t spend much time sitting idle. This was obviously a smart decision.
So, why haven’t universities been smart enough to centralize an even more valuable computational resource, their computer labs?
As any student of Software Carpentry will tell you, it is far more important to be able to program well than it is to have access to a really large high performance computing center. This means that the most important computational resource a university has is the classes that teach their students how to program, and the computer labs on which they rely.
At my university [2] all of the computer labs on campus are controlled by either individual departments or individual colleges. This means that if you want to teach a class in one of them you can’t request it as a room through the normal scheduling process, you have to ask the cognizant university fiefdom for permission. This wouldn’t be a huge issue, except that in my experience the answer is typically a resounding no. And it’s not a “no, where really sorry but the classroom is booked solid with our own classes,” it’s “no, that computer lab is ours, good luck” [3].
And this means that we end up wasting a lot of expensive university resources. For example, last year I taught in a computer lab “owned” by another college [4]. I taught in the second class slot of a four slot afternoon. In the slot before my class there was a class that used the room about four times during the semester (out of 48 class periods). There were no classes in the other two afternoon slots [5]. That means that classes were being taught in the lab only 27% of the time or 2% of the time if I hadn’t been granted an exception to use the lab [6].
Since computing skills are increasingly critical to many areas of science (and everything else for that matter) this territoriality with respect to computer labs means that they proliferate across campus. The departments/colleges of Computer Science, Engineering, Social Sciences, Natural Resources and Biology [7] all end up creating and maintaining their own computer labs, and those labs end up sitting empty (or being used by students to send email) most of the time. This is horrifyingly inefficient in an era where funds for higher education are increasingly hard to come by and where technology turns over at an ever increasing rate. Which [8] brings me to the title of this post. The solution to this problem is for universities to stop allowing computer labs to be controlled by individual colleges/departments in exactly the same way that most classrooms are not controlled by colleges/departments. Most universities have a central unit that schedules classrooms and classes are fit into the available spaces. There is of course a highly justified bias to putting classes in the buildings of the cognizant department, but large classes in particular may very well not be in the department’s building. It works this way because if it didn’t then the university would be wasting huge amounts of space having one or more lecture halls in every department, even if they were only needed a few hours a week. The same issue applies to computer labs, only they are also packed full of expensive electronics. So please universities, for the love of all that is good and right and simply fiscally sound in the world, start treating computer labs like what they are: really valuable and expensive classrooms.
—————————————————-
[1] Think of a single scientist who keeps 10 expensive computers, only uses them a total of 1-2 months per year, but when he does the 10 computers aren’t really enough so he has to wait a long time to finish the analysis.
[2] And I think the point I’m about to make is generally true; at least it has been at several other universities I’ve worked over the years.
[3] Or in some cases something more like “Frak you. You fraking biologists have no fraking right to teach anyone a fraking thing about fraking computers.” Needless to say, the individual in question wasn’t actually saying frak, but this is a family blog.
[4] As a result of a personal favor done for one administrator by another administrator.
[5] I know because I took advantage of this to hold my office hours in the computer lab following class.
[6] To be fair it should be noted that this and other computer labs are often used by students for doing homework (along with other less educationally oriented activities) when classes are not using the rooms, but in this case the classroom was a small part of a much larger lab and since I never witnessed the non-classroom portion of the lab being filled to capacity, the argument stands.
[7] etc., etc., etc.
[8] finally…