Archive for August, 2007

Social Networking for Open Source Programmers

Ohloh is an open source network that connects people through the software they create and use. I’ve been monitoring it for some time and have been impressed with the direction it is going. It is gradually turning into a kind of social networking tool for open source programmers. It’s is not like the likes of MySpace and Facebook, relationships and activity is extracted from commit logs of open source projects so there is no need for people to manually maintain their relationships with others.

However, we should recognise that the approach of using version control commit logs is very limited. It does not recognise the contributions of users who report bugs and feature requests, assist in clarifying documentation and perform a great many other useful activities which do not show up in commit logs. It also misses people who do things like contribute to documentation (unless it is stored in version control) . Similarly, those who participate in design discussions on the mailing list are not credited. Finally, it does not recognise activities such as evangelism and community development.

Even when we recognise these limitations, and only focus on programmer activity, there are problems. Raw logs only indicate the number of commits, not the value of those commits. For example, someone running a script to format the code will be seen to have made a major contribution, but in fact they have not added any functional value to the project. Similarly, the user who submits a patch fixing a really sticky bug will not be spotted by OhLoh since the commit log will credit a committer, not the contributor, with the activity, furthermore, OhLoh has no way of knowing it was a complicated bug they squashed.

It is possible for users to indicate their non-commit activity on a project, and a Kudos system allows others to acknowledge the value of their peers contributions. But these details have to be manually maintained.

When considering these limitations, it should be recognised that OhLoh is quite young, but developing fast. Furthermore, it is about to publish an API that will allow other projects to extend its functionality. For example, I’m currently working on a proposal that will allow social networking profiles to be extracted from publicly archived mailing lists. Hopefully their API will enable me to feed this data back into OhLoh via the Simal project.

Even with its current limitations OhLoh is interesting, it may even be useful. I’ll be watching OhLoh with interest.

Ubuntu tries to get community in the US

The Ubuntu project is having a major campaign to get get “LoCos” up and running. These Local Community teams are centred in a geographical location, rather than around a piece of software as most Ubuntu teams are. The Ubuntu vision for LoCos is:

[E]nthusiasts and fans around the world have collect[ing] together in garages, universities and pubs to talk about their interest, learn from each other and help promote their interest.

Which is very similar to the generic Linux User Groups (LUGs) which have been widespread for the last 10-15 years, and before that Unix Users Groups. Ubuntu are co-opting this model, and using the LoCos as a focus for:

  • Distributing free install CDs and other promotional material
  • Install fests, release parties and other gatherings
  • Speaking bookings for Ubuntu speakers
  • Localisation of Ubuntu software, documentation and websites
  • Promotion of Ubuntu locally

Ironically, while Ubuntu appears to have had little trouble getting LoCos set up in the developing world, where the focus on localisation is very strong, they have been struggling to get LoCos up and running in the US. This is probably partly down to the existing entrenched users groups and partly down to the lesser demand for the resources that a LoCo can supply. The current campaign aims to get LoCos running in every state in the US and they’re doing well, with 39 state teams in the start-up phase.

Many of the LoCos seem to have been set up by existing Ubuntu developers in direct competition for developer mindshare with existing local user groups. It will be interesting to see whether this community seeding attempt is successful in the medium to long term.

If you’re interested in building community around your open source project, OSS Watch can help.

The licence doesn’t matter

Many newcomers to open source get caught up in licence discussions. However, when it comes to adoption of open source software in larger organisations and governments “rarely if ever do licensing questions come up.” At least this is a claim recently made by Dave Rosenberg (CEO of MuleSource) in his Q&A with himself, in which he discusses MuleSource’s decision to release under the newly approved CPAL licence.

Dave also says:

Open source is thriving in big companies and governments. I can’t even believe the uptake that is going on.

So, if the licence does not matter to these organisations, why are there so many OSI approved licences? Surely just one Open Source licence would do the trick?

I think the licence does matter, at least for some users. Dave does acknowledge this when discussing MuleSource’s adoption of CPAL as opposed to the GPLv3:

There are several reasons [for choosing CPAL rather than GPLv3]. First of all, we’re not convinced that there is enough clarity about the way our software works (typical deployments have Mule touching 2 or more other applications via many different methods like JMS, web services etc.) to be able to accurately explain how derivative works are created. There are also a host of other wacky Java/integration aspects that are not totally clear. Under no circumstance do we want to stifle adoption of the product or upset the user community.

Matt Assay, an advisor to MuleSource, observes:

I find this fascinating. In some projects, derivative works are fairly straightforward. Not in an ESB (Enterprise Service Bus) like MuleSource. To ensure maximum community contribution, therefore, MuleSource is bending over backward to ensure its customers and community have an easy-to-grok license.

The licence may not be a major consideration for organisations who intend to be users, as opposed to contributors. However, to be sustainable an open source project ought to be encouraging and enabling all users to become contributors.

Different licences not only differentiate between the options available to users and developers, but they also differentiate between the type of community that can be developed around an open source product. MuleSource recognise this and have chosen the licence most appropriate to their product and sustainability plans.

So, what licence are you going to choose? OSS Watch offer free consultations to UK HE and FE projects to help navigate this, and other, open source minefields.

Get more done with community led projects

How do you get people to work on your code without them knowing it?

At OSS Watch we are building an RDF based project catalogue, called Simal, the initial work was based on some work I did, with David Reid, over at Apache for their project catalogue. My contributions to this work built on Apache Forrest, although the Apache projects site eventually opted to use a perl based transformation system so we now only share XSL stylesheets.

As part of the Simal project I have had to make significant improvements to Forrest, of course all this work has been donated back to Forrest and is now available to other Forrest users. But this post is not about the outward benefit, it is about the inward benefit Simal has from engagement with open communities.

Much of this improvement work has been to add plugin support for a new alpha feature in Forrest called the dispatcher. The dispatcher has been around in Forrest for some time and is just now starting to realise its full potential. Simal uses dispatcher in ways it has never been used before, for example, it has Simal now includes features such as an Ajax browser based on Exhibit from MIT and an RSS feed reader using the Google AJAX Feed API.

Because most of these projects (Google AJAX Feed API excepted) are open source I’ve been able to produce a reasonably functional catalogue in very little active developer time, OSS Watch is devoting a mere half a day a week to this project.

But what happens when you hit a bug in another projects code? Simple, fix it and apply a patch. But what if you can’t fix it, what if your knowledge of the project is insufficient?

Well that is exactly what happened recently. I hit a weird bug that prevented my Ajax features from working correctly in certain circumstances. I was lost, I did not know what the problem was, or even where to start looking. After narrowing down the problem as tightly as I could I posted a mail to the Forrest dev list asking for pointers. I tried to describe the problem in as much detail as I could. Around 45 minutes later Thorsten Scherler announced he was going to take a look at it for me.

Here we see the true benefit of an open, community led project. Others are willing to help, often with pointers and ideas, sometimes with actual development effort.

Why did Thorsten want to expend his energy on my problem? I can only guess at his personal reasons, but my first guess would be that because he is the original author of the majority of the dispatcher code and he recognises that I am a power user of his code, he considers me a valuable user. By supporting me, he will ensure that I continue to work with his code, to help identify and iron out bugs and to continue to enhance it when I hit its limits.

Furthermore, and most importantly, by telling me what the problem is and trusting me to review his commits in working on the problem I will learn about how the deep innards of the dispatcher work. This then means that in the future I can assist other users who hit problems with it. More satisfied users means a more sustainable project.
Around one hour later he reported that he had reproduced the problem and identified two potential locations where it may be routed. One was code I am not familiar with, the other was in code I know well. Since this news came at the end of the day I decided not to debug, but to go to bed.

The next day I found another community member had made a suggestion that would help narrow the problem, so now there were three people working on it. I debugged the code I knew and found it was not the source of the problem. Since I was due to go on holiday I notified the community of my findings and went off to enjoy a long sunny weekend in a field listening to music.

When I returned Thorsten had committed what he thought was a fix. Fantastic, people solving my problems while I’m on holiday. People, that is, who are not in my team and are not directly related to Simal.

Unfortunately, the bug was still present, but Thorsten had indicated the area he thought the problem was in his patch. This information enabled me to perform further debugging work. Whilst I didn’t get to the root of the problem I was able to define and document a workaround in the Forrest issue tracker. Now following users can work around the problem until a fix is put in place.

I suspect the information in our discussions and the provision of my workaround will be enough for Thorsten to find the real cause of the problem and so the next release of Forrest will have one bug less.

This is an example of community development at its best. My thanks go out to all the wonderfully talented people working on Simal, even though most of them don’t even know they are contributing. At a rough count there are about fifteen of them, not a bad return on an investment of half a day week.

Depositing documents in repositories: Which repositories should we use?

Recently during a discussion with Pete Cliff from RSP, the question arose “which repository, if any, should OSS Watch be putting our documents in?” The possible answers were:

  1. don’t put them anywhere
  2. put them in an institutional repository
  3. put them in a subject-specific repository
  4. put them in a funder-specific repository
  5. put them in a creative commons repository
  6. some combination of the above

I’ll confess straight-off that I’m enough of a bibliophile and library-lover that first option doesn’t appeal, besides our entire remit at OSS Watch is dissemination based, and it makes no sense to hide our outputs under a bushel. Repositories are to open access what version control systems are to open source; it’s very hard to argue against their use except where copyright is unclear, which is generally a sign of larger problems.

Institutional repositories are something I’m less than confident about.

Recently I completed my PhD in Computer Science at the University of Waikato, and ingest of my thesis into the institutionally supported “Australasian Digital Theses Program” repository was a seamless part of the submission and degree granting process. It just worked, the only slight wrinkle was that I wrote in LaTeX rather than Word as the documented submission process assumed. I have confidence that in a hundred years a copy of my thesis will still be in existence (I have less confidence it will be of interest, but never mind).

Once I was awarded my degree I also deposited the thesis with my employers institutional repository (Oxford Eprints), primarily to increase the chance that future Research Assessment Exercise-type activities would also “just work.” Several months after depositing with Oxford Eprints, I got an email from OULS, saying that they were migrating from Oxford Eprints to Oxford University Research Archive, and that my deposit did not meet the collections policy for the new archive, so it would be dropped. Presumably the new archive either isn’t going to be used for RAE-type work, or PhD theses are not considered research. Dropping works from an archive seems like a crazy policy to me, since I always understood the difference between a library and an archive was that an archive didn’t drop deposits after ingest.

Thus I have very mixed feelings about institutional repositories.

Subject-specific repositories can also be very effective. arXiv.org is the canonical example here. It’s a huge repository of peer-review papers in physical and mathematical sciences built to overcome the problem that in hot topics research is moving faster than journals can be published on paper. arXiv.org has been around for more than ten years and remains well-supported, well-trusted and well-used. The problem I have with subject-specific repositories is that they are insular, they further entrench the two cultures and act as barriers to communication.

I strongly believe that the currently well-funded physical sciences would benefit from a little cross subsidising of the infrastructure of currently less well-funded fields. Look at Middle-Eastern studies in western universities before and after 11 September 2001. Literally overnight (at least for those of us down under at the time), this was a field became of huge popular, political and academic interest. Shared infrastructure enables hot topics to scale up rapidly. Let’s not forget, either, that during the western European dark ages (when Galileo Galilei was on the Index Librorum Prohibitorum) it was the Middle-East that kept alive the foundations of those currently well-funded physical sciences, to the extent that the field of Algebra is named after the work of Muhammad ibn M?s? al-Khw?rizm? who worked in Baghdad, and the text of “On Divisions of Figures” by Euclid is known only by re-translation from the Arabic.

There has been recent discussion of JISC setting up a repository for outputs of work they fund. This would allow them to ensure that materials they fund can be found and promoted as appropriate. It would also greatly enable enforcement of the rules about making outputs available to the UK higher and further education. For the fundees, a repository would enable them to quickly and easily provide evidence of the project outputs. In the medium to long-term I have my doubts about the such a repository, however. While I have no doubt that Oxford University and the field of physics will be around in 100 years, I’m not confident that the JISC will be. Unless it’s clear where the repository will be in 100 years, I’m a little hesitant. Having it hosted by research institution which is paid up-front to preserve them in perpetuity would be adequate. The JISC already funds Jorum, a repository of learning content, but my understanding is that unlike research outputs, learning content is not expected to be archived in perpetuity: it’s a library rather than an archive.

There are a number of archives using creative commons licences (and one or two others) as a political message. These have a particular resonance with the open archives, open source and open content movements, in that actively work to promote content reuse, thoughtful use of copyright and build communities around shared content. While I have sympathies with these political messages, the archives are typically of uncertain long-term sustainability. In a hundred years the political message is unlikely to be relevant, and while some may adapt and evolve, many are likely to wither and vanish. If we’re lucky some will get absorbed into institutions which already have long-term sustainability.

Placing a document in multiple repositories has benefits. You potentially get the advantages of each of the repositories at the cost of redepositing the document. By creating two sets of metadata describing a document during deposit, there is the potential to later use that metadata for cross-walking the repositories (this only works, of course, if the document has a unique id across the repositories). But by diluting the unique holdings of a repository, you make it a less attractive target for preservation funding or to be absorbed into larger repositories. Automated ingest from one repository to another or deposit into two repositories with the same metadata schema doesn’t work either, since really what you’re building is a mapping from one metadata schema to another based on instances which have been human-classified.

There are a number of technical and consortia-based approaches which solve many of these problems. The US, UK or EU science funding bodies may decided that explicit direct funding for repositories with scope to match that of research libraries is beneficial in the short to medium term. Automated OAI-PMH techniques may evolve to the point where researchers expect full local mirrors of all significant archives globally (consider the UK mirror of arXiv.org). The large science publishers may release that the on-coming light is not the end of the tunnel. Copyright law may get completely rewritten. But none of these are directly relevant to the question of which repository should we be putting our documents in right now.

Ideas anyone?