Depositing documents in repositories: Which repositories should we use?

Recently during a discussion with Pete Cliff from RSP, the question arose “which repository, if any, should OSS Watch be putting our documents in?” The possible answers were:

  1. don’t put them anywhere
  2. put them in an institutional repository
  3. put them in a subject-specific repository
  4. put them in a funder-specific repository
  5. put them in a creative commons repository
  6. some combination of the above

I’ll confess straight-off that I’m enough of a bibliophile and library-lover that first option doesn’t appeal, besides our entire remit at OSS Watch is dissemination based, and it makes no sense to hide our outputs under a bushel. Repositories are to open access what version control systems are to open source; it’s very hard to argue against their use except where copyright is unclear, which is generally a sign of larger problems.

Institutional repositories are something I’m less than confident about.

Recently I completed my PhD in Computer Science at the University of Waikato, and ingest of my thesis into the institutionally supported “Australasian Digital Theses Program” repository was a seamless part of the submission and degree granting process. It just worked, the only slight wrinkle was that I wrote in LaTeX rather than Word as the documented submission process assumed. I have confidence that in a hundred years a copy of my thesis will still be in existence (I have less confidence it will be of interest, but never mind).

Once I was awarded my degree I also deposited the thesis with my employers institutional repository (Oxford Eprints), primarily to increase the chance that future Research Assessment Exercise-type activities would also “just work.” Several months after depositing with Oxford Eprints, I got an email from OULS, saying that they were migrating from Oxford Eprints to Oxford University Research Archive, and that my deposit did not meet the collections policy for the new archive, so it would be dropped. Presumably the new archive either isn’t going to be used for RAE-type work, or PhD theses are not considered research. Dropping works from an archive seems like a crazy policy to me, since I always understood the difference between a library and an archive was that an archive didn’t drop deposits after ingest.

Thus I have very mixed feelings about institutional repositories.

Subject-specific repositories can also be very effective. arXiv.org is the canonical example here. It’s a huge repository of peer-review papers in physical and mathematical sciences built to overcome the problem that in hot topics research is moving faster than journals can be published on paper. arXiv.org has been around for more than ten years and remains well-supported, well-trusted and well-used. The problem I have with subject-specific repositories is that they are insular, they further entrench the two cultures and act as barriers to communication.

I strongly believe that the currently well-funded physical sciences would benefit from a little cross subsidising of the infrastructure of currently less well-funded fields. Look at Middle-Eastern studies in western universities before and after 11 September 2001. Literally overnight (at least for those of us down under at the time), this was a field became of huge popular, political and academic interest. Shared infrastructure enables hot topics to scale up rapidly. Let’s not forget, either, that during the western European dark ages (when Galileo Galilei was on the Index Librorum Prohibitorum) it was the Middle-East that kept alive the foundations of those currently well-funded physical sciences, to the extent that the field of Algebra is named after the work of Muhammad ibn M?s? al-Khw?rizm? who worked in Baghdad, and the text of “On Divisions of Figures” by Euclid is known only by re-translation from the Arabic.

There has been recent discussion of JISC setting up a repository for outputs of work they fund. This would allow them to ensure that materials they fund can be found and promoted as appropriate. It would also greatly enable enforcement of the rules about making outputs available to the UK higher and further education. For the fundees, a repository would enable them to quickly and easily provide evidence of the project outputs. In the medium to long-term I have my doubts about the such a repository, however. While I have no doubt that Oxford University and the field of physics will be around in 100 years, I’m not confident that the JISC will be. Unless it’s clear where the repository will be in 100 years, I’m a little hesitant. Having it hosted by research institution which is paid up-front to preserve them in perpetuity would be adequate. The JISC already funds Jorum, a repository of learning content, but my understanding is that unlike research outputs, learning content is not expected to be archived in perpetuity: it’s a library rather than an archive.

There are a number of archives using creative commons licences (and one or two others) as a political message. These have a particular resonance with the open archives, open source and open content movements, in that actively work to promote content reuse, thoughtful use of copyright and build communities around shared content. While I have sympathies with these political messages, the archives are typically of uncertain long-term sustainability. In a hundred years the political message is unlikely to be relevant, and while some may adapt and evolve, many are likely to wither and vanish. If we’re lucky some will get absorbed into institutions which already have long-term sustainability.

Placing a document in multiple repositories has benefits. You potentially get the advantages of each of the repositories at the cost of redepositing the document. By creating two sets of metadata describing a document during deposit, there is the potential to later use that metadata for cross-walking the repositories (this only works, of course, if the document has a unique id across the repositories). But by diluting the unique holdings of a repository, you make it a less attractive target for preservation funding or to be absorbed into larger repositories. Automated ingest from one repository to another or deposit into two repositories with the same metadata schema doesn’t work either, since really what you’re building is a mapping from one metadata schema to another based on instances which have been human-classified.

There are a number of technical and consortia-based approaches which solve many of these problems. The US, UK or EU science funding bodies may decided that explicit direct funding for repositories with scope to match that of research libraries is beneficial in the short to medium term. Automated OAI-PMH techniques may evolve to the point where researchers expect full local mirrors of all significant archives globally (consider the UK mirror of arXiv.org). The large science publishers may release that the on-coming light is not the end of the tunnel. Copyright law may get completely rewritten. But none of these are directly relevant to the question of which repository should we be putting our documents in right now.

Ideas anyone?

3 thoughts on “Depositing documents in repositories: Which repositories should we use?

  1. Stevan Harnad

    Where should research (preprints, published articles, theses, data) be deposited?

    In the researcher’s own Institutional Repository (IR) first (other repositories too, if desired, for redundancy). Oxford made a serious mistake in not migrating your deposit, but these are early days. Universities will come to understand their responsibilities once Open Access IR content grows, worldwide, and with it the understanding of what OA IRs are about, and for. Rational, coherent IR policies and practices will soon evolve. Your thesis deposit was alas one of the early casualties. (Fortunately, redeposit is only a few keystrokes away, and both you and your text are still alive and well!)

    The primary function of OA IRs is to provide OA for immediate, ongoing research use. Their secondary function is preservation (and this will become central once IRs take over the access-provision and preservation function from journal publishers and libraries).

    Preservation means a systematic, continuous, global redundancy and migration policy. All achievable, but what’s needed first is the OA content. That will drive the preservation policy, not vice versa.

    Model University Self-Archiving Policy

    Optimizing OA Self-Archiving Mandates: What? Where? When? Why? How?

    Stevan Harnad
    American Scientist Open Access Forum

  2. Ross Gardler

    Is it sufficient to say “in the researcher’s own Institutional Repository”? I note from Stevan Harnard post on “Optimizing OA Self-Archiving Mandates” (see previous comment) that the emphasis is on OAI compliant repositories, that makes sense. However, don’t we also need to ensure that people are aware of the repository and can actually find the data within?

    By submitting to multiple repositories aren’t both maximising the reachability of the content and helping to ensure preservation is likely (through redundancy)?

    Why limit ourselves to a single institutional repository?

Comments are closed.