In Neil Stewart's recent guest post on this blog he lamented the The Unfulfilled Promise of Aggregating Institutional Repository Content; in the context of his work with the CORE projects at the Open University Owen Stephens (@ostephens) commented on that post about "technological and policy barriers to a 3rd party aggregating content from UK HE IRs" and has subsequently posted in more detail over on the CORE blog.
Not to put too fine a point on it, I think Owen has identified issues that are fundamental to the potential value of our repository infrastructure in UK HE, at least in terms of responsible 3rd parties building services on top of that infrastructure - though Owen also asks in the title of his post "What does Google do?" for which the short answer is that it indexes (harvests) metadata and full text for (arguably) commercial re-use unless asked not to by robot.txt. This is not necessarily to suggest that Google is irresponsible, it may well be but that is a rather bigger discussion!
For CORE, by comparison, it has understandably been important to establish individual repository policy on re-use of metadata and full text content; where such policies exist at all they are invariably designed to be human readable rather than machine readable which is obviously is not conducive to automated harvest, in spite of guidance being available on how to handle both record, set and repository level rights statements in OAI-PMH from http://www.openarchives.org/OAI/2.0/guidelines-rights.htm.
To quote Owen in his review of policies listed in OpenDOAR he found that "Looking at the 'metadata' policy summaries that OpenDOAR has recorded for these 125 repositories the majority (57) say "Metadata re-use policy explicitly undefined" which seems to sometimes mean OpenDOAR doesn't have a record of a metadata re-use policy, and sometimes seems to mean that OpenDOAR knows that there is no explicit metadata re-use policy defined by the repository. Of the remaining repositories, for a large proportion (47) OpenDOAR records "Metadata re-use permitted for not-for-profit purposes", and for a further 18 "Commercial metadata re-use permitted"."
It might be suggested that machine-readability is actually secondary to what is potentially misconceived policy in the first place - or which hasn't perhaps been fully thought through and at the very least is fatally fragmented across the sector - and that arguably is the result of lip-service rather than based on what actually happens in the real (virtual) world.
For my own part, in my institutional role, I was very, er, green (no pun intended) when I defined our repository policies back in 2008 using the OpenDOAR policy creation toolkit - http://www.opendoar.org/tools/en/policies.php - and to be frank I haven't really revisited them since. I suspect I'm not terribly unusual. To quote Owen once more, "the situation is even less clear for fulltext content than it is for metadata. OpenDOAR lists 54 repositories with the policy summary "Full data item policies explicitly undefined", but after that the next most common (29 repositories) policy summary (as recorded by OpenDOAR) is "Rights vary for the re-use of full data items" - more on this in a moment. OpenDOAR records "Re-use of full data items permitted for not-for-profit purposes" for a further 20 repositories, and then (of particularly interest for CORE) 16 repositories as "Harvesting full data items by robots prohibited".
The (reasonably unrestrictive) metadata and full-text policies I chose at Leeds Metropolitan University state that "the metadata may be re-used in any medium without prior permission for not-for-profit purposes and re-sold commercially provided the OAI Identifier or a link to the original metadata record are given" and "copies of full items generally can be reproduced, displayed or performed, and given to third parties in any format or medium for personal research or study, educational, or not-for-profit purposes without prior permission or charge". Even this, with the word "generally" implicitly recognises the fact that there may be different restrictions that apply to different items which to some extent reflects the complexity of negotiating copyright for green OA, not to mention the other types of records that repositories may hold (e.g. our repository also comprises a collection of Open Educational Resources [OER] which are in fact licensed at the record level with a Creative Commons URI in dc:rights as in this example - http://repository-intralibrary.leedsmet.ac.uk/IntraLibrary-OAI?verb=GetRecord&identifier=oai:com.intralibrary.leedsmet:2711&metadataPrefix=oai_dc)
Nor are my policies available in a machine readable form (which as we've established is typical across the sector) and I'm not actually sure how this could even be achieved without applying a standard license like Creative Commons?
Owen goes on to consider "What does Google do?", if you haven't already it's certainly worth reading the post in full but he concludes that "Google, Google Scholar, and other web search engines do not rely on the repository specific mechanisms to index their content, and do not take any notice of repository policies". Indeed, I think in common with many repository managers, I devote a lot of time and effort on SEO to ensure my repository is effectively indexed by Google et al and that full-text content can be discovered by these global search engines...which seems somewhat perverse when our own parochial mechanisms and fragmented institutional policies make it so difficult to build effective services of our own.