The dominance of open source tools in Big Data

Most of the tools that are best suited for dealing with Big Data are open source. This provides the research community with a huge opportunity, because no investment in software licenses is needed. You just download the software and ‘get on with it’. The challenge, as became clear at the Eduserv symposium last week, is to find people with the right skills to apply these tools.

Without a doubt, Apache Hadoop that is the most important open source project in this space. It is amazing to see how fast the Apache Hadoop ecosystem is growing and how everyone is trying to jump on the bandwagon. Start-up companies like Cloudera and Hortonworks have no trouble finding venture capitalists willing to invest large sums of money. Similarly, nearly every major tech company is offering it, while other internet companies that deal with big data are using it (secretly or not). At the Eduserv symposium, EMC CTO Rob Anderson focused on the implication big data has for storage, and showed their Hadoop-based offering. Because the Apache licence allows you to use any Apache project in a closed-source implementation, EMC can sell their Hadoop distribution without needing to make that product open source.

There are big implications of the big data trend for the research community. Guy Coates of the Sanger institute showed how the amount of data they are managing is increasing rapidly. They are expecting this increase to continue, especially since the costs of human DNA sequencing is dropping dramatically. They expect it to drop to $1000 for a full scan within two years (excluding storage!). His main challenge was not the actual storage of the data, but the management of the data as researchers were analysing it. Sanger is using the open source tool iRODS, a community-driven project that originates from the Data Intensive Cyber Environments (DICE) research group in the DICE Center at the University of North Carolina.

Another open source project that featured prominently at the Eduserv symposium was Apache CouchDB. Simon Metson of Bristol University explained how NoSQL is the enabler of big data and new database systems that do not use the traditional relational database approach are better suited for these tasks. Open source software projects like CouchDB, but also Apache Cassandra, are leading in this space. Simon highlighted that the community-aspect of big data is very important. By engaging with the community that uses these tools to solve their big data problem, you can solve the hard problems. Something you may encountered once in a thousand times, may have been solved by someone else in the community who runs into it more often, and vice versa.

The closing keynote was given by Anthony D Joseph, professor at the AMP Lab at UC Berkeley. He mentioned how Facebook started the Open Compute project to share best practice in cluster design for big data centers. It is an interesting example of the old economic adage that you should commodotise your complement. Berkeley is collaborating on the Apache Incubator project Mesos, which is a scalable cluster manager that can dynamically share resources between multiple computing frameworks. They support frameworks like Hadoop, Spark and MPI.

So the technology is there or is well underway in being developed. And being open source, anyone can download and start using it. Technology is not the problem of big data, but the challenges lie in the cultural and organisational change that is needed to capitalise on big data. People within and across the organisations need to be willing to share their data and think of new, intelligent and creative ways of making use of this data. Two well-known examples that were mentioned were the Google flutrends, a website that predicts flu epidemics based on what people search for, and a Twitter application that was created to detect and report on earthquakes using people’s tweets.

A final challenge that was recognised widely at the conference was the shortage of skilled people in the big data space. This is true both for the data scientists that were needed to analyse the data, and for people that can help curate the data longer term, which is a completely different challenge for many HE institutions. In the spirit of open source though, there are many resources freely available online for people who want to get started, such as on the website bigdatauniversity.com. And of course, if you want to get started with one of the open source projects mentioned, there are many ways to get involved.

What makes a community led project work?

This guest post has been contributed by Ross Gardler of OpenDirective. Ross is Vice President of Community Development at The Apache Software Foundation and a mentor at the Outercurve Foundation. Ross has been active in open development of open source software for over ten years.

OSS Watch has been participating in the development of Apache Rave, a ‘next-generation portal engine, supporting (Open)Social Gadgets as well as WC3 widgets’. As Sander observes in this blog, the Rave ecosystem is made up of a ‘diverse range of collaborators’ from both the academic and commercial sectors. These partners are sharing resources in order to build a critical piece of software at lower cost as well as to increase innovation around that product.

A few days ago I posted an evaluation of the Apache OpenOffice project’s journey through the Apache Incubator (all code entering the Apache Software Foundation (ASF) must pass through the incubator). That post looked at what makes an Apache project different from many other open source project. This post repeats many of the same points, but rather than examine them from the point of view of OpenOffice I will examine why predominantly academic team behind Apache Rave chose to go to the ASF.

Continue reading ‘What makes a community led project work?’

FRAND or FOSS?

Standards in technology are generally considered to be a good thing. Having documented technologies that can be implemented by all means that businesses can compete on equal terms and consumers benefit from the effects of this competition. Of course, before a technology can be standardised, individual technology players need to do the work of innovation to develop the techniques the standard will encompass. Sometimes these technology players will have sought to protect their investment in innovation by obtaining a patent for the innovative technology they have created. Patents are designed to provide a monopoly over a specific technological process for the owner, so how does this monopoly fit in with the idea of a standard?

Continue reading ‘FRAND or FOSS?’

Don’t keep your data under your desk

It is a well-known problem for researchers. Data is being collected for a research project and no decision has been made about how to manage the data during the project. Naturally, once you have finalised the project and start publishing on the end results, you may deposit your final dataset in a institutional repository such as your university’s DSpace or E-prints repository, or you may even put it in Dryad. However, that is not sufficient to keep your data safe while you are still working on it. Often, such data ends up on a computer that just happens to lie around in the office or department, or even on the researcher’s local machine.
Continue reading ‘Don’t keep your data under your desk’

Why Open?

This question was raised to me recently, and comes up frequently. It’s complicated by the fact that the word ‘open’ means many things to many people, but there are threads of commonality through all of the varying definitions. So the question is: “Why is openness useful to the public sector?” There are many answers to this, but here I’d like to concentrate on one that is perhaps less frequently cited.

In 2003, early in OSS Watch’s history, Sebastian Rahtz and Stuart Yeates drafted a policy on open source software for our funders the JISC, beginning it during a long train journey to an event. JISC had been receiving questions from the community about its attitude to open source, which was becoming a something of a hot topic. I had joined OSS Watch at its inception, having worked in other externally funded projects here in Oxford before that. One thing that had become clear quickly was that intellectual property rights were often an afterthought among projects, and that particularly where project work involved collaboration between institutions, failing to sort out those rights early could result in hair-tearing complications by project end. Where the problem was not solved, project outputs could remain undistributed, and the public money invested in them locked away.  Of course JISC was even more aware of this than any individual institution. Thus the open source policy served the dual purposes of spelling out the benefits of open licensing of resources and introducing the idea that intellectual property rights needed to be dealt with early in a project’s lifecycle.

The policy introduced a presumption that software developed with JISC resources would be open source. While this might seem like a value judgement about openness, the fact that projects could make an argument against openness where they felt it would be detrimental was another key component. In practice projects could take either approach, but what they could not do was ignore the issue. The openness presumption provided a default exploitation model that would allow maximum reusability of the publicly funded resources.   If the project’s host institution felt that a different llicensing model would suit the work better, then that option was open to them. All they needed to do was to justify it.

So one use of openness for publicly funded works is – I would argue – to stimulate creative thinking about exploitation. If the default assumption is that the intellectual property will be ‘in the cupboard’ and ready for exploitation when we get around to it, it is all too easy to postpone the decision. Operational complications can then mean it is forgotten altogether. If we begin with a default policy of openness, we know that this cannot happen, and the option to draft variant exploitation models means that we do not limit anyone’s creative thinking.

JISC were ahead of the curve in identifying the root problem here and implementing the policy to deal with it. As we have worked with other public funders over the years it has been extremely useful to point to the policy and the thinking behind it.

Graduating Apache Rave project demonstrates open innovation in software

The Apache Rave project graduated from the Incubator last month. This means that the Rave project has demonstrated to be a viable project community, which is being governed well according to the meritocratic principles of the Apache Software Foundation.

Apache Rave provides a next-generation portal engine, supporting (Open)Social Gadgets as well as WC3 widgets. Have a go with the latest release and you will see that it works out-of-the-box, but it can alternatively serve as the basis for an enterprise-level social portal application.

Continue reading ‘Graduating Apache Rave project demonstrates open innovation in software’

Meeting at the Junction: cross-sector collaboration seeded at OSS Watch workshop

Last week saw the third edition of Open Source Junction. Two days of presentations and interactive sessions with representation of the commercial and the academic sector. It was a successful workshop with a lot of interesting interaction and new ideas for collaboration were being discussed.

The report of the workshop will be published shortly. For now, please have a read through the live blogs of days one and two below and check the slides of the sessions for more information.

Open sourcing software essential for reproducible science

I was very pleased last week to read that Nature published an editorial that argued for open sourcing software that had been used in the research leading up to a publication.

The ground principle is very simple: in order for claimed scientific results to be credible, it must be possible to verify those results. The key to doing that properly is the ability to reproduce these results. And if there is some piece of software code used to create these results, that is not made available to the scientific community, it is not possible for the wider community to reproduce the results.
Varies initiatives have been taken to ensure this academic principle is followed. For example, a conference like Sigmod installs a repeatability committee that will need the software used for the creation of accepted papers. Although this is good to ensure Sigmod’s papers have been thoroughly checked, it will not enable other researchers in the field to verify the results or to build on them.

Luckily, many scientists see open sourcing their code as a normal practice in their research, such as Daniel Lemire. The software project provides a solid basis for collaboration, and as such is an example of open innovation in academic communities.

One of the barriers that is cited against open sourcing code, is that the university may see commercial value in it and wish to commercialise it. It is important to realise, however, that there are many business models available for institutions that go far beyond just selling the raw outputs of a software projects. All of these still allow the institutions to create a viable business by adding value to what is available on the download page of a code hosting website. The true value of a software project is never just in the raw code.

At OSS Watch, we work with academic projects that develop software as part of their research and provide free support to the UK academic sector. So if you have a question about how to open source your code or how to deal with licences, we welcome you to contact us.

Upcoming JISC OSS Watch Webinars

This is just a quick plug for a webinar that I will be running – with the kind assistance of JISC – next Wednesday (7th March) on the topic: “Choosing the right open source licence”. To quote the blurb:

There are many free and open source software licences, and while they all broadly attempt to facilitate the same things, they also have some differences. Some of the major differences can be grouped together into categories, and this talk acts as an introduction to these categories. Having attended this session, you should be able to understand which decisions you should take in order to select a licence for your code.

Delegates will take away an understanding of:

  • the main categories of open source licences available
  • the implications of choosing one for the future of your software

Also, advance notice that the week after, on Wednesday March 14th, OSS Watch’s Sander Van Der Waal will be asking: “How healthy is your open source community?

To be viable, academic projects using open source software need to ensure that people continue to engage with their project beyond initial funding. Similarly, academic institutions and businesses seeking to adopt open source solutions developed as part of academic projects need to be sure they can do so without exposing themselves to unmanageable risk. By using the Software Sustainability Maturity Model, both businesses and academic users and developers can identify any weak points in their development and governance processes, and address them as appropriate. This session will provide the participants with the skills to assess the non-technical aspects of open source software development.

Having attended this session, you will be able to answer the question: “Can a business collaboration be built around this open source project?” You will understand how to evaluate the health of an open source community and plan for sustainable engagements with companies interested in developing products or services based on it.

I hope you can join us!

Open source “matches proprietary code quality”

Sometimes we are asked to give an opinion on a particular piece of open source software and its quality in comparison to a specific closed source alternative. Of course, with the sheer number of projects and products out there, it is often very hard to answer these kind of questions with any authority, and this means that we can often not give a detailed answer. On one occasion where I was personally asked this kind of question, I gave the usual disclaimer and set about asking what contacts I had in that specific problem domain what their opinion was (for my own edification as much as that of the questioner). One particular response I got back was interesting; I’ll paraphrase as the communication was not intended to be public. In essence the respondent – someone with long years’ experience in this particular area – told me that they had heard good things about the open source implementation but that in their opinion only an idiot would ever use it for ‘real world tasks’. It stood to reason, they argued, that open source must necessarily be buggier and less professional than closed source, and notwithstanding anything they heard to the contrary about the quality of this particular solution, they could not recommend anyone waste their time with it.
Continue reading ‘Open source “matches proprietary code quality”’