The amount of data that is being generated is still rapidly increasing and both the commercial and the academic sector are working to tackle new challenges that arise from it. These are exciting times for open source projects like Apache Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers. Many big IT players like Microsoft, IBM, Oracle and Amazon use Hadoop in their offerings.
Academic researchers also continue to generate bigger and bigger data sets. This provides not only challenges for processing the data (something Hadoop can help with), but these data sets need to be managed as well. This involved aspects like version management and longer term curation of the data, to make sure they are and will remain available, just as the scientific publications that were created based on the data.
One exciting project that OSS Watch is currently involved with is DataFlow. This is a project that is tackling the issue of research data management in two stages.
Firstly, there is a software tool called DataStage. In a way this tool works similar to the popular tool Dropbox: researchers can save files to a dedicated location on a network drive, which means it will be stored on a departmental server and the file will be version-managed automatically. As a result, a new version will be created whenever a file is changed and saved onto the drive, which means that the researcher can always go back to a previous version of the file if necessary.
The second stage of data curation is when a file or a set of files is finally used for a publication and the researcher wants the data set to be available for other researchers, or wants to include a DOI reference to the data set. The researcher can then copy the file over to DataBank, an institution-level research data repository.
Both DataStage and DataBank are open source software projects so we welcome potential users and developers to try it out. The projects carry the permissive open source licence MIT. This makes it possible for commercial companies to include the software in a proprietary offering.
Many universities are looking for a solution for Research Data Management and we believe that the software DataFlow is developing are very useful tools that fulfil that requirement. Join us on our mailing list and find out more about this project!