Tuesday 6 March 2012

The Unfulfilled Promise of Aggregating Institutional Repository Content (Guest Post)

Our thanks to Neil Stewart, Digital Repository Manager at City University London for the following guest post which raises some interesting questions for us all.
A very good question was posed on Stephen Curry’s blog by Bj√∂rn Brembs recently (Curry and Brembs are a couple of the more prominent figures supporting the Elsevier boycott):
I’ve always wondered why the institutional repositories aren’t working with, e.g. PubMed etc. to make sure a link to their version is displayed with the search results. I mean, how difficult can this be?
This got me thinking, how difficult can it be? Aggregating and re-using institutional repository (IR) content at subject level is, after all, one of the promises of the Green road to Open Access.
The infrastructure is already in place, in the form of the many OAI-PMH compliant institutional repositories out there, and there is also the SWORD client, which allows flexible transfer of repository content. Some examples do exist- for example the Economists Online service, which harvests material from selected economics research-intensive universities, then makes it available via a portal. But (to my knowledge) there has been no work done to provide a way of e.g. ensuring all a repository’s eligible physics content is automatically uploaded to ArXiv, or all biomedical research to UKPMC.
Subject repositories have gained critical mass in certain disciplines (to add to the examples above, see also RePEC for economics, SSRN for social sciences and DBLP for computer science), meaning that if a paper doesn’t appear there, it’s far less visible. This means that the incentive to post locally is greatly reduced- yes, your paper will appear in Google, but a paper in ArXiv will appear both in Google and in the native interface of the repository where everyone else in your discipline is depositing.
So if the infrastructure is there and the rationale to create these links exists, why has it not been happening to any meaningful extent already? I suspect it’s because of the fact that the IR landscape is, by its very nature, a fragmented one. Those with responsibility for IRs (managers, IT people, and senior management) are understandably concerned with local issues: ensuring that IRs are properly managed and integrated with the university’s systems, as well as the usual open access and service awareness-raising and advocacy. Having time to think about the automatic population of ArXiv with papers from your home repository is probably pretty far down one’s to-do list.
That’s not to say that repository managers are oblivious to these issues- but here another problem arises. Few individual repository managers, I would guess, would think that they individually could negotiate with and persuade ArXiv that  automatic harvesting of physics content from their repository, and their repository alone, would be worth ArXiv’s while. This is, perhaps, where UKCoRR (or other national bodies- JISC perhaps?) might come in. If ArXiv or similar subject repositories could be persuaded of the merits of harvesting IR content (whether full text or metadata pointing back to IR holdings), it would allow all repositories to plug in to this system, and offer it as a service to academics (two for one deposit- local IR and ArXiv at the same time!)
So, what do people think? Is there any appetite for turning this into a project that UKCoRR members could take forward, perhaps with UKCoRR and/ or JISC oversight? Comments please!


  1. I'm currently working with the CORE projects at the Open University core-project.kmi.open.ac.uk which is building an aggregation of content from UK HE repositories. CORE is looking to offer both services to the repository community (notably a 'similarity' measure between papers based on machine analysis of each paper's semantic content), and also offer search platforms that search across metadata and full-text content harvested.

    I think there are a number of questions that can be asked about aggregation and which are relevant to the question of why we haven't seen more of this type of activity in the sector.

    1) does aggregating UK HE repository content create a meaningful set of content to search? I think the subject based repositories mentioned definitely have this going for them, but I'm not 100% convinced that people start by thinking "I want to look for stuff in UK HE repositories".

    2) does aggregation work from a technology perspective? That is - purely from a technical point of view how easy is it to harvest and aggregate content?

    3) does aggregation work from a policy perspective? That is - do IRs have policies in place that allow aggregation, or make it simple?

    CORE may have something to say about all of these, and perhaps particularly (2) and (3). I'm afraid that my work so far suggests that there are both technological and policy barriers to a 3rd party aggregating content from UK HE IRs.

    For example, many IRs have policies on harvesting and reuse of metadata and full-text content, but rarely (never?) in a machine readable format, and in some cases restrictive to the point where the metadata/content would not strictly be allowed to feature in an aggregation.

    I'm worried that issues around (2) and (3) serve to stifle innovation in this area. What should be easy becomes difficult, and the outcome is that we don't see good aggregations being built to test the underlying question posed by (1) - what (if anything) makes a compelling aggregation of IR content.

    However, I don't want to be overly pessimistic - CORE, and other projects, are starting to push at these issues, and there is interest from the community as this post illustrates. I hope shortly to say something about CORE and harvesting of content on the CORE blog, and by the end of the project we should be able to offer some guidance to the sector on what things/approaches enable aggregation.

  2. This issue of re-use rights is a really difficult one IMO. Even if I'd like to make the entire contents of the repository I manage CC-BY (including for text mining etc.) or similar, I can't do so because of the potential concerns and rights of publishers and authors, which may themselves differ. I'm sometimes tempted just to try it and see what happens, but again it's not much use if just my IR is licensed in this way! The problem of fragmentation again...

  3. Bjoern Brems has issued a veritable call to arms for Librarians and IR managers here: http://occamstypewriter.org/scurry/2012/02/21/an-open-letter-on-open-access-to-uk-research-councils/#comment-8750 (he had some problems posting direct to the UKCoRR blog apparently)

  4. I'm actually thinking about this same problem right now in reverse. The problem that IR managers typically have is getting researchers to deposit in their home IRs at all. Researchers are more likely to deposit their work in a repository that is recognized by their discipline. It seems to me that IRs should be using SWORD and other similar programs to harvest data from subject repositories like ArXiv to help populate their local IRs so that researchers aren't having to duplicate their submissions.