Thursday, 22 March 2012

Are your repository policies worth the HTML they are written in?

In Neil Stewart's recent guest post on this blog he lamented the The Unfulfilled Promise of Aggregating Institutional Repository Content;  in the context of his work with the CORE projects at the Open University Owen Stephens (@ostephens) commented on that post about  "technological and policy barriers to a 3rd party aggregating content from UK HE IRs" and has subsequently posted in more detail over on the CORE blog.

Not to put too fine a point on it, I think Owen has identified issues that are fundamental to the potential value of our repository infrastructure in UK HE, at least in terms of responsible 3rd parties building services on top of that infrastructure - though Owen also asks in the title of his post "What does Google do?" for which the short answer is that it indexes (harvests) metadata and full text for (arguably) commercial re-use unless asked not to by robot.txt. This is not necessarily to suggest that Google is irresponsible, it may well be but that is a rather bigger discussion!

For CORE, by comparison, it has understandably been important to establish individual repository policy on re-use of metadata and full text content; where such policies exist at all they are invariably designed to be human readable rather than machine readable which is obviously is not conducive to automated harvest, in spite of guidance being available on how to handle both record, set and repository level rights statements in OAI-PMH from

To quote Owen in his review of policies listed in OpenDOAR he found that "Looking at the 'metadata' policy summaries that OpenDOAR has recorded for these 125 repositories the majority (57) say "Metadata re-use policy explicitly undefined" which seems to sometimes mean OpenDOAR doesn't have a record of a metadata re-use policy, and sometimes seems to mean that OpenDOAR knows that there is no explicit metadata re-use policy defined by the repository. Of the remaining repositories, for a large proportion (47) OpenDOAR records "Metadata re-use permitted for not-for-profit purposes", and for a further 18 "Commercial metadata re-use permitted"."

It might be suggested that machine-readability is actually secondary to what is potentially misconceived policy in the first place - or which hasn't perhaps been fully thought through and at the very least is fatally fragmented across the sector - and that arguably is the result of lip-service rather than based on what actually happens in the real (virtual) world.

For my own part, in my institutional role, I was very, er, green (no pun intended) when I defined our repository policies back in 2008 using the OpenDOAR policy creation toolkit - - and to be frank I haven't really revisited them since. I suspect I'm not terribly unusual. To quote Owen once more, "the situation is even less clear for fulltext content than it is for metadata. OpenDOAR lists 54 repositories with the policy summary "Full data item policies explicitly undefined", but after that the next most common (29 repositories) policy summary (as recorded by OpenDOAR) is "Rights vary for the re-use of full data items" - more on this in a moment. OpenDOAR records "Re-use of full data items permitted for not-for-profit purposes" for a further 20 repositories, and then (of particularly interest for CORE) 16 repositories as "Harvesting full data items by robots prohibited".

The (reasonably unrestrictive) metadata and full-text policies I chose at Leeds Metropolitan University state that "the metadata may be re-used in any medium without prior permission for not-for-profit purposes and re-sold commercially provided the OAI Identifier or a link to the original metadata record are given" and "copies of full items generally can be reproduced, displayed or performed, and given to third parties in any format or medium for personal research or study, educational, or not-for-profit purposes without prior permission or charge". Even this, with the word "generally" implicitly recognises the fact that there may be different restrictions that apply to different items which to some extent reflects the complexity of negotiating copyright for green OA, not to mention the other types of records that repositories may hold (e.g. our repository also comprises a collection of Open Educational Resources [OER] which are in fact licensed at the record level with a Creative Commons URI in dc:rights as in this example -

Nor are my policies available in a machine readable form (which as we've established is typical across the sector) and I'm not actually sure how this could even be achieved without applying a standard license like Creative Commons?

Owen goes on to consider "What does Google do?", if you haven't already it's certainly worth reading the post in full but he concludes that "Google, Google Scholar, and other web search engines do not rely on the repository specific mechanisms to index their content, and do not take any notice of repository policies". Indeed, I think in common with many repository managers, I devote a lot of time and effort on SEO to ensure my repository is effectively indexed by Google et al and that full-text content can be discovered by these global search engines...which seems somewhat perverse when our own parochial mechanisms and fragmented institutional policies make it so difficult to build effective services of our own.


  1. Thanks for this thoughtful response Nick. I'm going to follow up in the next couple of weeks with a post describing what CORE is actually going to do about harvesting metadata and fulltext. While it's been important to do this investigation, as an aggregator we've got to make a decision about what we do, and we obviously want to balance the needs of the repository owners with our own aims and mission.

    I think the approach we are planning to take walks this line, but it is really important for us as a project to get feedback from the 'repository' community and to continue the discussion whatever decisions we make now for the purposes of our current project. I hope that this blog and other UKCoRR channels will provide a way of doing this.

  2. A side note, I tried to do some oai-pmh harvesting recently across a group of repositories and it just wasn't worth it. More often than not resumption tokens didn't work, and the metadata often lacked any identifier to explain what the metadata was for - which made harvesting it somewhat frivolous.

    In the end I reverted back to screen scraping the sites, as that was a lot more reliable.

    1. Hi Pat. I also tried to do some OAI-PMH harvesting (for one evening) a couple of years ago, and posted a very, very badly written blog post about it

      In short, I found it difficult to count the number of full text items as the same DC field was used by so many things

    2. Pat, this echoes the experience we've had with CORE. I've already blogged on the issues of linking from the metadata record to the thing described (with a particular focus on 'fulltext') at - but I haven't mentioned that resumption token issues are among the more common of the problems we've encountered when harvesting records.

  3. This is a very valid point.

    Like you, we tried to make our policy fairly realistic and open:
    Anyone may access the metadata free of charge.
    The metadata may be re-used in any medium without prior permission for not-for-profit purposes and re-sold commercially provided the OAI Identifier and/or a link to the original metadata record are given." If anything, if I was reviewing it today I would remove the need to link back (it's probably unrealistic)

    IR records are just webpages, and anyone who has ever looked at a web log file will know there are thousands of web crawlers (some of which may be nice enough to refer to your robots file, most will not) to index and reuse the data.

    I'm also not convinced that we need author's/publisher's permissions to redistribute. When we share our Catalogue bibliographic records we do not ask each individual author/publisher if they mind us passing on a record describing their book, and I don't see any real difference with IRs.

    Information on the web will be reused, this can only be a good thing for us.