UK Council of Research Repositories

Monday 16 July 2012

UKCoRR blog has moved

Please note that the UKCoRR blog is now part of our main web-site - http://ukcorr.org/activity/blog/

This blog will not be updated; all content will remain but has also been migrated to the new site

Friday 11 May 2012

EThOS Update!

Guest Post by EThOS Service Manager, Sara Gould:

I’ve been following the recent posts here with interest. The open access discussions are fundamental to EThOS of course although theses generally make up only a small portion of your repository content and are unlikely to be anywhere near the most challenging content to manage.

In many ways EThOS is in a privileged position: it simply needs to reflect your own policies and practice in making thesis metadata and full-text content open to the world. If you want it in EThOS, we’ll do what we can to get it in there. We’re so close to 60,000 theses and 300,000 records now – watch out for that mini celebration.

Interestingly OpenDOAR considers EThOS out of scope because of its requirement for users to log in to access full-text theses. True, the login process is a bit of a deterrent but it does provide some reassurance for authors that we could track users if we ever needed to, and it does give us a chance to look at user demographics.

We’re about to send out a summary of usage stats to all member institutions, and here’s an example. This is a JISC Band C member institution that we’ve been harvesting from for some months now. I’m watching the balance between clickthroughs to the repository-held copy and downloads from EThOS with interest. We might expect clickthroughs to quickly overtake, especially as the proportion of harvested content that includes a link URL v. older digitised content in EThOS rises steadily.


Date	Theses Harvested	Digitisation Requests	Records Created	Referrals to Organisation Repositories	Theses downloaded
Sep-11	40	0	1	0	58
Oct-11	15	1	0	0	63
Nov-11	55	3	1	0	58
Dec-11	55	3	1	14	38
Jan-12	0	1	7	20	58
Feb-12	0	5	0	25	83
Mar-12	55	7	1	30	67
Total	220	20	11	89	425

This institution also supports digitisation of its own older theses. 20 in-demand theses digitised in the last 6 months: not bad at all. I love this part of the EThOS service – creating a critical mass of digitised theses was one of its original aims and it’s still a really neat function.

Harvest and interoperability – the subject of Nick Shepherd’s post here – has been a little more challenging. But we’re getting there. Last month we harvested 2600 theses from 33 institutions. Within the BL, we’ve transferred the metadata harvest over to the metadata experts – seems logical – and Heather Rosie will be in touch with everyone waiting to be harvested over the next couple of months. She’s also overseeing the upgrade of records by the cataloguing team and trying to keep EThOS and the BL Primo catalogue consistent in their display of EThOS content. She’s desperate to eliminate the many duplicate records on EThOS, and we have a plan for that too.

What about flows of records and theses in the other direction? Heather’s responding to requests from resource discovery services to share the metadata, and we’re expecting Primo Central to announce that EThOS data is available via their services any day now. And a reminder that the rather clunky EThOS Download Tool can be used to pull back your own digitised theses from EThOS. Contact Customer Services for more info on that.
But what we all want is full OAI-PMH interoperability. The tech guys at the BL are aiming to crack that challenge soon so everyone would be able to easily harvest the metadata without intervention from us. We ask you to be OAI-PMH compliant so we can harvest from you so it seems only fair we do what we can in the other direction.

Finally, a quick trailer for our EThOS workshop at Open Repositories 2012 in July. Hope to see you in Edinburgh.

Sara Gould
11/5/12

Thursday 3 May 2012

Jimmy Wales to advise government on open access to research

Interesting press coverage on the 2nd May 2012 that Jimmy Wales is to help the government ensure that all publically funded research is available freely online within two years. David Willetts made this announcement in a speech delivered to the Publishers Association on the evening of 2nd May.

Reading the OA lists, there’s a range of opinion from scepticism to warm welcome for his involvement. He’s certainly high profile and long been a proponent of open access. Remember, the 24 hour closure of Wikipedia in protest at the proposed SOPA and PIPA legislation in the US. Celebrity involvement does guarantee that the press will take note.

On balance, UKCoRR believes that we should welcome the plans put forward today but there is a need to ensure that the government also listens to those who have been working in UK academia to promote and extend open access. It would seem a sensible approach to work with that resource and experience that already exists than simply starting from scratch and that existing projects and infrastructure are built upon.

According to the Guardian,

“This initiative is most likely to result in a central repository that will host all research articles that result from public funding. The aim is that, even if an academic publishes their work in a traditional subscription journal, a version of their article would simultaneously appear on the freely available repository. The repository would also have built-in tools to share, comment and discuss articles.”

There is a dearth of detail about implementation at the moment – understandably as the group convened by Dame Janet Finch won’t be reporting until June 2012. But it seems likely that the “central repository” won’t be a physical thing but will build on current infrastructure and projects. There has to be a pivotal role for Repository Junction which is “a standalone middleware tool for handling the deposit of research articles from a provider to multiple repositories” thus avoiding the thorny problem of duplicate deposit, which is understandably disliked by academics. See http://edina.ac.uk/cgi-bin/news.cgi?filename=2012-04-24-rjbroker-ori.txt

However, the bigger stumbling block is that old chestnut copyright. Simultaneous traditional publication and availability in an open access repository of publically funded research is restricted depending on the publisher’s policy’s, which can change in an instant. We repository workers all know the minefield that is journal copyright policies and the care which our host organisations take to avoid breaching them. The solution is to replace a practice where the author signs away their copyright with one where they give the publisher a non-exclusive licence to publish the article. Let’s get that in the two year plan and we’d really be making progress!

The text of David Willetts speech was published this morning and it makes interesting reading. He makes much reference to the gold road to open access but, given the context, this is not surprising. We might take issue with his definition of green: “Green means publishers are required to make research openly accessible within an agreed embargo period” but this is a minister telling the publishing industry that open access is here to stay. “Our starting point is very simple. The Coalition is committed to the principle of public access to publicly-funded research results. That is where both technology and contemporary culture are taking us. It is how we can maximise the value and impact generated by our excellent research base. As taxpayers put their money towards intellectual enquiry, they cannot be barred from then accessing it”.

Right at the end of the speech, there was a reference to the REF which indicates that open access is being considered for inclusion in the criteria for assessment: “HEFCE is also considering the issue. Peer review and assessment of impact are crucial to their allocation of research funding. The debate on open access will inform HEFCE's planning for the research excellence process that succeeds the current one which concludes in 2014. Open access could be among the excellence criteria for qualifying articles in the future”. This is really exciting stuff – it would really change academic’s practice and behaviour. Let’s keep this on the
agenda.

All in all, a good day for open access.

Tuesday 3 April 2012

UKCoRR Responds to RCUK's Revised Policy on Open Access

As members will be aware over the past couple of weeks we have been collating feedback and comments from you all on the RCUK's revised policy on open access. Many, many thanks to those members and Committee whom have contributed to the drafting of UKCoRR's response - which I can confirm has just been submitted to their communications officer. In the spirit of the openness that underlies everything UKCoRR does - the text of the communication follows. Naturally we'll share with you any and all response the Committee receives.
----
UKCoRR, on behalf of its membership, wishes to respond to the RCUK’s proposed policy on access to research outputs, given that it is an area that directly impacts on the activities of our 250+ members across the UK’s research establishments.

At the outset, UKCoRR would like to commend the RCUK for a very clear, positive and explicit statement. In particular the restriction of support for embargoes longer than 6 months post-publication (12 for AHRC and ESRC funded work). UKCoRR are also delighted to note the tone of the policy in favouring more Open Access friendly journals and publishers for RCUK funded research outputs.

There are, however, several issues, concerns and clarifications that our membership would like us to flag up for your attention and consideration.

Definition of Open Access
UKCoRR would query the definition of Open Access to scholarly publications as digital journal articles. "The Research Councils define Open Access to mean unrestricted, on-line access to peer reviewed and published scholarly research papers." [p2] and likewise your classification of books or monographs as “grey’ literature” [p3]. There are many other types of authoritative research output besides pdfs of journal articles, especially in arts and humanities research. It is our belief that the OA movement needs to lobby for access to all research outputs, and that there is a need to ensure that RCUK’s policy does not exclude this significant portion of the scholarly literature. The contents of our members’ Open Access repositories clearly demonstrate the richness, value and readership of such scholarly works.

Creative Commons
The CC licence is a potentially contentious issue [p2] and one that UKCoRR believes may create more problems than it solves.

The current model of Green OA (i.e. self-archiving in an IR) as opposed to Gold (i.e. author-institution pays), is predicated on a pragmatic negotiation of copyright with academic publishers and, along with high profile initiatives like SHERPA RoMEO, has encouraged growth of the number of RoMEO green journals which allow self archiving, and therefore in the potential number of OA papers that can be self-archived in repositories. UKCoRR is concerned that the new policy will potentially impact on this strategy, and on repositories which will have to assert CC-BY on relevant papers after a maximum embargo period of 6 months.

It is not clear what effect the policy may have on journals defined as ‘green’ by SHERPA RoMEO and on the application of embargoes. Journals may choose to become RCUK-compliant, to remain RoMEO green but not compliant, or to reject green and compliance altogether with individual publishers’ policy likely to be informed by the potential loss of revenue resulting from a universal 6 month embargo and by the extent to which other funders will follow RCUK's lead on CC-BY.

HE institutions are increasingly adopting policies on OA and self-archiving in their repositories and UKCoRR has some concerns that the policy should more explicitly consider implications to this Green route to Open Access

Limitation on Publication Destination
While UKCoRR acknowledges the value of this part of the policy, we note that it is likely to cause anxiety among academic stakeholders unless backed up by clear support mechanisms. If our members’ author stakeholders are made to feel more constrained on their choice of where to publish [p3] it could generate an adverse perception and consequent resistance towards the implementation of Open Access policies within their organisation; already an issue that many of our members encounter in the course of their work. However, we look forward with interest to the impact this may have on those publishers who to date have been reluctant to embrace more Open Access compliant author copyright licence agreements.

Impact on Publisher Policy
The policy would require a significant change in many publishers’ policies, and would limit the choice of journals to those that are recognised as compliant. Sources of funding for Open Access compliance will therefore become a more pressing issue across many institutions, where few currently have resources set aside to cope with such moves [p5]. We would encourage RCUK to be more proactive in ensuring that all researchers, especially PIs, and research administrators are made fully aware that this is part of the indirect costs allocated within the grant application and award. Clarity on RCUK’s stance in this regard would be especially welcomed by our membership.

Access to Data Outputs
The final paragraph in section 8 says "The underlying research materials do not necessarily have to be made Open Access" [p6]. While UKCoRR understands that mandating open data publication is out of the scope of this policy, we would welcome an approach that encourages this. There is an opportunity here to get more open data for use and reuse by the research community while respecting the normal provisos about commercial sensitivity, patient confidentiality etc.

Compliance Mechanisms
Our membership would also be interested in knowing more about how the “mechanisms to help ensure compliance” will operate [p6]. Given the time scales and workloads already placed upon our members, greater clarity in this regard and at the earliest opportunity will make such mechanisms more practicable for us to implement.

OA Fields and ROS
Given the timing of this, we would also query the lack of proper OA fields built in to the recently launched Research Outcomes System (ROS). There are fields for SHERPA/RoMEO data and preprint/postprint, but nothing to clarify if an OA version actually exists. While we are aware of discussions ongoing with the JISC that UKCoRR has also been party to, we would welcome this being highlighted in the policy.

Final Comments
Overall, as strong advocates of the cultural, economic and intellectual benefits of Open Access to the UK’s unique research outputs UKCoRR keenly supports this bold policy.

We welcome any opportunity to enter into a dialogue with the RCUK as to how this could be cascaded, both to our membership and our academic stakeholders.

Thursday 22 March 2012

Using Google Analytics Statistics within DSpace

Thank you to Claire Knowles of Edinburgh University who provides this overview of how they have been able to display statistics from Google Analytics in DSpace.

----

In 2009 Edinburgh University Digital Library adopted Google Analytics (GA) to track usage statistics within the DSpace Repositories it supports on behalf of the Scottish Digital Library Consortium (SDLC). The GA statistics have proven much more reliable than the existing plugins available for DSpace previously with which we experienced lost statistics and inflated pageviews resulting from robots.

Unfortunately the GA statistics for sites being tracked are only viewable via the GA dashboard for which users require a Google account and managed permissions. This limits the visibility of statistics to a few people at each institution. Prompted by the presentation given by Graham Triggs (then working for BioMed Central) at the Open Repositories Conference 2010, we decided to write some code to make the Google Analytics statistics visible to all users of the DSpace installations.

The work has been broken into phases:

1. Capture of downloads in DSpace by Google Analytics.

The basic GA tracking code within DSpace is unable to capture the number of file downloads as these are not links within pages. To address this we added code to the two downloads on the item page to enable these download actions to be measured. This captured all downloads within Dspace but not those users coming directly from search engines to the download file. To capture these statistics we decided to reroute all users back through the item page. This means that they now have two clicks instead of one to reach the download but it enables us to capture these statistics and also raises the visibility of the Repository to users. To reduce the inconvenience to the users we moved the file downloads links on the item page from the bottom to the top so that they do not have to scroll down to find the download.

2. Adding page views to each item page within DSpace

Secondly, we added the number of page views within the last year to the item page. This was a proof of concept which showed that we could connect to the Google Analytics API and pull back statistics into DSpace. We decided to only include the number of views for the past year to reduce any disparities between the the number of pageviews between older and new items.

3. Making statistics viewable within the DSpace web pages.

We decided to make the GA statistics available at three levels: item, collection and repository as this provides most of the statistics which are requested by users. Using the Query Explorer provided by Google we were able to test and refine our queries before starting development. The pages were developed using the Google Analytics java API, jQuery and the Google Chart tools to draw graphs and maps.

As we complete the rollout of Google Analytics to all the SDLC partners we are starting to look to what other statistics we would like to make available both from Google Analytics and also possible exposing statistical information about DSpace using Google's chart tools. One statistic that would be of interest to researchers is collating and presenting download figures for authors (rather than by item/collection/community).

We have encountered problems separating the item, collection and community statistics within DSpace as all of their urls are formatted in the same way, we therefore have to query DSpace data to do this and cannot distinguish them using the statistics data alone. If the requested item, file, collection or community is not available in DSpace an error page is returned, these were being recorded in the same way as successful page which has led to invalid items being listed in the statistics top ten tables. To prevent this error pages are now recorded as an error event within Google Analytics.

These changes have given us much greater understanding of how our repository is being used with the majority of users coming directly from Google. The URLrewrite change led to a double of our download statistics as we now capture users who previously went straight to the download.

Thanks to: Scottish Digital Library Consortium, Stuart Wood and Gareth Johnson of University of Leicester for information on the URLrewrite, Graham Triggs formerly of BioMed Central and now Sympletic.

The code to enable GA stats within DSpace is freely available from github: https://github.com/seesmith/Dspace-googleanalytics

You can view our collection and item statistic changes at http://www.era.lib.ed.ac.uk

Graham Trigg’s slides from OR2010: http://www.slideshare.net/OpenRepository/enhancing-statistics-google-analytics-and-visualization-apis

Are your repository policies worth the HTML they are written in?

In Neil Stewart's recent guest post on this blog he lamented the The Unfulfilled Promise of Aggregating Institutional Repository Content; in the context of his work with the CORE projects at the Open University Owen Stephens (@ostephens) commented on that post about "technological and policy barriers to a 3rd party aggregating content from UK HE IRs" and has subsequently posted in more detail over on the CORE blog.

Not to put too fine a point on it, I think Owen has identified issues that are fundamental to the potential value of our repository infrastructure in UK HE, at least in terms of responsible 3rd parties building services on top of that infrastructure - though Owen also asks in the title of his post "What does Google do?" for which the short answer is that it indexes (harvests) metadata and full text for (arguably) commercial re-use unless asked not to by robot.txt. This is not necessarily to suggest that Google is irresponsible, it may well be but that is a rather bigger discussion!

For CORE, by comparison, it has understandably been important to establish individual repository policy on re-use of metadata and full text content; where such policies exist at all they are invariably designed to be human readable rather than machine readable which is obviously is not conducive to automated harvest, in spite of guidance being available on how to handle both record, set and repository level rights statements in OAI-PMH from http://www.openarchives.org/OAI/2.0/guidelines-rights.htm.

To quote Owen in his review of policies listed in OpenDOAR he found that "Looking at the 'metadata' policy summaries that OpenDOAR has recorded for these 125 repositories the majority (57) say "Metadata re-use policy explicitly undefined" which seems to sometimes mean OpenDOAR doesn't have a record of a metadata re-use policy, and sometimes seems to mean that OpenDOAR knows that there is no explicit metadata re-use policy defined by the repository. Of the remaining repositories, for a large proportion (47) OpenDOAR records "Metadata re-use permitted for not-for-profit purposes", and for a further 18 "Commercial metadata re-use permitted"."

It might be suggested that machine-readability is actually secondary to what is potentially misconceived policy in the first place - or which hasn't perhaps been fully thought through and at the very least is fatally fragmented across the sector - and that arguably is the result of lip-service rather than based on what actually happens in the real (virtual) world.

For my own part, in my institutional role, I was very, er, green (no pun intended) when I defined our repository policies back in 2008 using the OpenDOAR policy creation toolkit - http://www.opendoar.org/tools/en/policies.php - and to be frank I haven't really revisited them since. I suspect I'm not terribly unusual. To quote Owen once more, "the situation is even less clear for fulltext content than it is for metadata. OpenDOAR lists 54 repositories with the policy summary "Full data item policies explicitly undefined", but after that the next most common (29 repositories) policy summary (as recorded by OpenDOAR) is "Rights vary for the re-use of full data items" - more on this in a moment. OpenDOAR records "Re-use of full data items permitted for not-for-profit purposes" for a further 20 repositories, and then (of particularly interest for CORE) 16 repositories as "Harvesting full data items by robots prohibited".

The (reasonably unrestrictive) metadata and full-text policies I chose at Leeds Metropolitan University state that "the metadata may be re-used in any medium without prior permission for not-for-profit purposes and re-sold commercially provided the OAI Identifier or a link to the original metadata record are given" and "copies of full items generally can be reproduced, displayed or performed, and given to third parties in any format or medium for personal research or study, educational, or not-for-profit purposes without prior permission or charge". Even this, with the word "generally" implicitly recognises the fact that there may be different restrictions that apply to different items which to some extent reflects the complexity of negotiating copyright for green OA, not to mention the other types of records that repositories may hold (e.g. our repository also comprises a collection of Open Educational Resources [OER] which are in fact licensed at the record level with a Creative Commons URI in dc:rights as in this example - http://repository-intralibrary.leedsmet.ac.uk/IntraLibrary-OAI?verb=GetRecord&identifier=oai:com.intralibrary.leedsmet:2711&metadataPrefix=oai_dc)

Nor are my policies available in a machine readable form (which as we've established is typical across the sector) and I'm not actually sure how this could even be achieved without applying a standard license like Creative Commons?

Owen goes on to consider "What does Google do?", if you haven't already it's certainly worth reading the post in full but he concludes that "Google, Google Scholar, and other web search engines do not rely on the repository specific mechanisms to index their content, and do not take any notice of repository policies". Indeed, I think in common with many repository managers, I devote a lot of time and effort on SEO to ensure my repository is effectively indexed by Google et al and that full-text content can be discovered by these global search engines...which seems somewhat perverse when our own parochial mechanisms and fragmented institutional policies make it so difficult to build effective services of our own.

Monday 19 March 2012

UKCoRR, RSP and DRF - Japan and the UK in Agreement

As you'll have probably seen last week on the lists UKCoRR, in collaboration with the RSP and Japan's DRF (Digital Repository Federation) have signed a memorandum of understanding.

The Memorandum includes a commitment to

Sharing experience and expertise
Inviting and possibly sponsoring representatives from partners to participate in RSP and DRF events
Joint efforts to seek funding and/or support

Obviously from UKCoRR's perspective (and being unfunded as we are) we're mostly about the first option in the agreement; but all the same it's the first time we've signed up to an international agreement and is something that all members can be proud of - the furtherance of recognition of the importance of the repository worker and manager around the world.

You can read more about this, and view the memorandum on the RSP's pages.