Tuesday, 21 December 2010

OAI-PMH aggregation?

I wanted to explore the issue of OAI-PMH aggregation and to gauge UKCoRR opinion of its still-to-be-realised potential (or not). I've also been threatening to post to the blog for a while and this seemed like an ideal subject to explore in a more public forum than on the mailing list.

As I noted once in a post on my own blog I have for some time been a little nonplussed by our collective, continued obsession with the woefully under-used OAI-PMH. Other than OAIster (an international service), the only services I'm currently aware of in the UK are the former Intute demo now maintained by Mimas - http://irs.mimas.ac.uk/demonstrator/ and a (pilot) OAI-PMH cross-search tool developed as part of ERIS (Enhancing Institutional Repository Services in Scotland)

The protocol dates back to the earliest days of the open access and institutional repository movements when there was considerable investment by the community, in software specification for example, and has never really, I don't think, been as widely used as it could be. I can offer only anecdotal evidence but I’m pretty sure that your average academic will tend towards Google/Google Scholar* - who withdrew support for OAI-PMH back in April 2008 - to source research on the open web. Google, however, arguably has inherent limitations for academic purposes and I would argue that OAI-PMH still has considerable potential for (OA) research dissemination (though possibly watered down by so many repositories also carrying metadata only records rather than exclusively full text - one of the draw-backs of OAI-PMH harvest is that there was no easy way of filtering on full text from the major repository software.)

* As an aside I've had mixed results retrieving full text records from UK IRs using Google Scholar with many not returning anything at all - though the IRs in question certainly contain full text content.

In Ireland, however, they have rian.ie - Pathways to Irish Research which is much more fully realised portal that aggregates 8 Irish IRs using OAI-PMH, enabling you to browse by author surname and offering an advanced search form to filter by keyword, title, author, subject, institution and (interestingly) funder. Aggregating just 8 repositories (5 DSpace, 2 EPrints, 1 Digital Commons) will obviously make it easier to standardise metadata and systems than in the UK and it also returns full text only which immediately makes it more useful from an OA perspective. I've been in touch with the chair of the RIAN project group who has confirmed that "it was a policy decision to include only full text metadata in the RIAN harvest, even though some IRs might have some metadata only deposits. It was felt that a national portal of OA research material would be much more useful if it included only full text." This was achieved, however, by organising local IRs such that only full text content is exposed for harvest which isn't really a practical solution across the much greater number of repositories in the UK.

I've had some discussion with James Toon, the project manager for ERIS who in his dealings with research groups in Scotland has found "no interest at all in just searching for data in a national aggregation". Nevertheless, I can't help but feel that there is still potential for an aggregation service with a high level of functionality especially if we could figure out how to return full text only. May be it's just me?

James suggests that "the power of aggregations are on the subject level, when you can do things with the data - such as enhance it by linking common ontology, or providing subject specific services, such as topic mapping and so on"; he has also been working on CRISpool which is using the CERIF standard to integrate heterogeneous research information from several institutions into a single Portal. Perhaps OAI-PMH has had it's day and CERIF-XML aggregation is the future; nevertheless, the current repository infrastructure across the UK does not (yet) widely support that format - though this may change if more institutions implement CRIS - whereas the older protocol is a standard output across all 183 institutional repositories in the UK currently listed on OpenDOAR and, for that reason, I would argue, if no other, could be used more effectively after the rian.ie model.


  1. One of the reasons OAI-PMH never took off that I can see is the lack of business model for the service provider end. OAIster itself has struggled along by finding people with money as needed and I'm guessing that rian.ie is funded publicly. Neither generate income, though, unless you count the indirect income that accrues from investment through awareness. But it has been hard to justify the costs of service providers in these terms.

    There are clearly other technical solutions now, of which linked data holds much promise. However, the same business model dilemma applies, and it will be interesting to see whether linked data services manage to overcome this barrier and succeed where OAI-PMH hasn't. Providing tools that allow low-level entry may be one approach (OAI service provider set-up was fine, but not simple).

    Having said that, I tend to agree that OAI-PMH may still have it's place for focused, targeted services where the business model works. EThOS is one example where OAI-PMH is proving useful in supporting the service's development, and a great deal of thought and effort went into developing the business model for that before it got going.


  2. Interesting post - a couple of quick thoughts:

    I would agree that OIA-PMH has a very strong feature - it is currently pretty much ubiquitous in the field. The difficulty of identifying metadata records which point to a full text artefact is not really a weakness of the protocol as such. I think that harvesters and aggregations will need to be able to adopt other protocols in addition to OAI-PMH, rather than instead of it.

    I agree with James when he suggests that it's the potential for enhancements to the data which are interesting. I also completely agree with his assertion that the community has no interest in going to a central aggregation just to search for records. It's been my contention for some time, however, that such a central aggregation could be an important component in a wider system. An aggregation of metadata is useful as a starting point - however harvested - as it solves the network latency problem. As more data is exposed elsewhere, the potential for applying data mashing techniques to enhance the aggregation becomes interesting.

    For all the rhetoric about Linked Data offering a more distributed data model for the Web, the emerging model seems closer to the existing one underpinning the Web of documents, with the widespread use of a few concentrations of data, enhanced by smaller, localised datasets for local requirements. If we view an aggregation of metadata records as open infrastructure, perhaps we can encourage others to apply the enhancements.

    UKOLN is working on this kind of aggregation and I would be very interested in hearing suggestions for the sorts of data enhancements which would be useful or interesting to readers.