The dominance of open source tools in Big Data

Most of the tools that are best suited for dealing with Big Data are open source. This provides the research community with a huge opportunity, because no investment in software licenses is needed. You just download the software and ‘get on with it’. The challenge, as became clear at the Eduserv symposium last week, is to find people with the right skills to apply these tools.

Without a doubt, Apache Hadoop that is the most important open source project in this space. It is amazing to see how fast the Apache Hadoop ecosystem is growing and how everyone is trying to jump on the bandwagon. Start-up companies like Cloudera and Hortonworks have no trouble finding venture capitalists willing to invest large sums of money. Similarly, nearly every major tech company is offering it, while other internet companies that deal with big data are using it (secretly or not). At the Eduserv symposium, EMC CTO Rob Anderson focused on the implication big data has for storage, and showed their Hadoop-based offering. Because the Apache licence allows you to use any Apache project in a closed-source implementation, EMC can sell their Hadoop distribution without needing to make that product open source.

There are big implications of the big data trend for the research community. Guy Coates of the Sanger institute showed how the amount of data they are managing is increasing rapidly. They are expecting this increase to continue, especially since the costs of human DNA sequencing is dropping dramatically. They expect it to drop to $1000 for a full scan within two years (excluding storage!). His main challenge was not the actual storage of the data, but the management of the data as researchers were analysing it. Sanger is using the open source tool iRODS, a community-driven project that originates from the Data Intensive Cyber Environments (DICE) research group in the DICE Center at the University of North Carolina.

Another open source project that featured prominently at the Eduserv symposium was Apache CouchDB. Simon Metson of Bristol University explained how NoSQL is the enabler of big data and new database systems that do not use the traditional relational database approach are better suited for these tasks. Open source software projects like CouchDB, but also Apache Cassandra, are leading in this space. Simon highlighted that the community-aspect of big data is very important. By engaging with the community that uses these tools to solve their big data problem, you can solve the hard problems. Something you may encountered once in a thousand times, may have been solved by someone else in the community who runs into it more often, and vice versa.

The closing keynote was given by Anthony D Joseph, professor at the AMP Lab at UC Berkeley. He mentioned how Facebook started the Open Compute project to share best practice in cluster design for big data centers. It is an interesting example of the old economic adage that you should commodotise your complement. Berkeley is collaborating on the Apache Incubator project Mesos, which is a scalable cluster manager that can dynamically share resources between multiple computing frameworks. They support frameworks like Hadoop, Spark and MPI.

So the technology is there or is well underway in being developed. And being open source, anyone can download and start using it. Technology is not the problem of big data, but the challenges lie in the cultural and organisational change that is needed to capitalise on big data. People within and across the organisations need to be willing to share their data and think of new, intelligent and creative ways of making use of this data. Two well-known examples that were mentioned were the Google flutrends, a website that predicts flu epidemics based on what people search for, and a Twitter application that was created to detect and report on earthquakes using people’s tweets.

A final challenge that was recognised widely at the conference was the shortage of skilled people in the big data space. This is true both for the data scientists that were needed to analyse the data, and for people that can help curate the data longer term, which is a completely different challenge for many HE institutions. In the spirit of open source though, there are many resources freely available online for people who want to get started, such as on the website And of course, if you want to get started with one of the open source projects mentioned, there are many ways to get involved.

1 thought on “The dominance of open source tools in Big Data

  1. Pingback: Analytics Reconnoitre: Notes on Open Solutions in Big Data from #esym12 JISC CETIS MASHe

Comments are closed.