Don’t keep your data under your desk

It is a well-known problem for researchers. Data is being collected for a research project and no decision has been made about how to manage the data during the project. Naturally, once you have finalised the project and start publishing on the end results, you may deposit your final dataset in a institutional repository such as your university’s DSpace or E-prints repository, or you may even put it in Dryad. However, that is not sufficient to keep your data safe while you are still working on it. Often, such data ends up on a computer that just happens to lie around in the office or department, or even on the researcher’s local machine.

People that are conscious about back-up issues may be using a solution like Dropbox, SkyDrive or Google Drive, but some issues exist around data ownership and rights that may prevent you from wanting to use these services.

So what would be easier than just saving it in a folder, as you would with tools like Dropbox, but have it backed up by the institution, version-controlled automatically and keeping it within the trusted boundary of your organisation? And still allowing you to optionally share the folders with your research group, or a wider group of people, whichever is appropriate.

This is what the open source tool DataStage offers you. Developed as part of the DataFlow project, it is a piece of software that will be installed at, usually, the departmental level of your institution, but it can also be hosted in a virtual ‘cloud’ infrastructure. It allows you as a user to simply map a network drive to it. You save files as normal, and everything will be handled for you. Near the end of the project, when you start publishing and want to make the datasets available to a wider public, you can push any dataset to a SWORD-compliant repository, such as the ones mentioned above or to a DataBank instance.

The beauty about an open source project like DataStage is that anyone is welcome to use the software and contribute towards its ongoing development. You can imagine there are many more use cases for a tool like this, which are unrelated to research data. Take for example the popular Raspberry Pi project. In a classroom situation where where all the kids have their own little computer, they can submit their homework via DataStage to the teacher who can centrally check everything on the main server and mark their work. This smart different application was highlighted by David Shotton in its presentation during the DataFlow Launch Workshop on 2 March.

Are you curious about what DataStage can do for you? Come and download our beta release to try it out and join us on the DataFlow mailing list to tell us about your experiences and what may be improved. We would love to hear from you!