Thursday 22 March 2012

Using Google Analytics Statistics within DSpace


Thank you to Claire Knowles of Edinburgh University who provides this overview of how they have been able to display statistics from Google Analytics in DSpace.
----
In 2009 Edinburgh University Digital Library adopted Google Analytics (GA) to track usage statistics within the DSpace Repositories it supports on behalf of the Scottish Digital Library Consortium (SDLC).  The GA statistics have proven much more reliable than the existing plugins available for DSpace previously with which we experienced lost statistics and inflated pageviews resulting from robots.

Unfortunately the GA statistics for sites being tracked are only viewable via the GA dashboard for which users require a Google account and managed permissions.  This limits the visibility of statistics to a few people at each institution.   Prompted by the presentation given by Graham Triggs (then working for BioMed Central) at the Open Repositories Conference 2010, we decided to write some code to make the Google Analytics statistics visible to all users of the DSpace installations.

The work has been broken into phases:

1. Capture of downloads in DSpace by Google Analytics. 
The basic GA tracking code within DSpace is unable to capture the number of file downloads as these are not links within pages.  To address this we added code to the two downloads on the item page to enable these download actions to be measured.  This captured all downloads within Dspace but not those users coming directly from search engines to the download file.  To capture these statistics we decided to reroute all users back through the item page. This means that they now have two clicks instead of one to reach the download but it enables us to capture these statistics and also raises the visibility of the Repository to users.  To reduce the inconvenience to the users we moved the file downloads links on the item page from the bottom to the top so that they do not have to scroll down to find the download. 

2. Adding page views to each item page within DSpace
Secondly, we added the number of page views within the last year to the item page.  This was a proof of concept which showed that we could connect to the Google Analytics API and pull back statistics into DSpace.  We decided to only include the number of views for the past year to reduce any disparities between the the number of pageviews between older and new items.

3. Making statistics viewable within the DSpace web pages. 
We decided to make the GA statistics available at three levels: item, collection and repository as this provides most of the statistics which are requested by users.  Using the Query Explorer provided by Google we were able to test and refine our queries before starting development.  The pages were developed using the Google Analytics java API, jQuery and the Google Chart tools to draw graphs and maps.  





As we complete the rollout of Google Analytics to all the SDLC partners we are starting to look to what other statistics we would like to make available both from Google Analytics and also possible exposing statistical information about DSpace using Google's chart tools. One statistic that would be of interest to researchers is collating and presenting download figures for authors (rather than by item/collection/community).

We have encountered problems separating the item, collection and community statistics within DSpace as all of their urls are formatted in the same way, we therefore have to query DSpace data to do this and cannot distinguish them using the statistics data alone.  If the requested item, file, collection or community is not available in DSpace an error page is returned, these were being recorded in the same way as successful page which has led to invalid items being listed in the statistics top ten tables.  To prevent this error pages are now recorded as an error event within Google Analytics.

These changes have given us much greater understanding of how our repository is being used with the majority of users coming directly from Google.  The URLrewrite change led to a double of our download statistics as we now capture users who previously went straight to the download.

Thanks to: Scottish Digital Library Consortium, Stuart Wood and Gareth Johnson of University of Leicester for information on the URLrewrite, Graham Triggs formerly of BioMed Central and now Sympletic.

The code to enable GA stats within DSpace is freely available from github: https://github.com/seesmith/Dspace-googleanalytics

You can view our collection and item statistic changes at http://www.era.lib.ed.ac.uk

Are your repository policies worth the HTML they are written in?

In Neil Stewart's recent guest post on this blog he lamented the The Unfulfilled Promise of Aggregating Institutional Repository Content;  in the context of his work with the CORE projects at the Open University Owen Stephens (@ostephens) commented on that post about  "technological and policy barriers to a 3rd party aggregating content from UK HE IRs" and has subsequently posted in more detail over on the CORE blog.

Not to put too fine a point on it, I think Owen has identified issues that are fundamental to the potential value of our repository infrastructure in UK HE, at least in terms of responsible 3rd parties building services on top of that infrastructure - though Owen also asks in the title of his post "What does Google do?" for which the short answer is that it indexes (harvests) metadata and full text for (arguably) commercial re-use unless asked not to by robot.txt. This is not necessarily to suggest that Google is irresponsible, it may well be but that is a rather bigger discussion!

For CORE, by comparison, it has understandably been important to establish individual repository policy on re-use of metadata and full text content; where such policies exist at all they are invariably designed to be human readable rather than machine readable which is obviously is not conducive to automated harvest, in spite of guidance being available on how to handle both record, set and repository level rights statements in OAI-PMH from http://www.openarchives.org/OAI/2.0/guidelines-rights.htm.

To quote Owen in his review of policies listed in OpenDOAR he found that "Looking at the 'metadata' policy summaries that OpenDOAR has recorded for these 125 repositories the majority (57) say "Metadata re-use policy explicitly undefined" which seems to sometimes mean OpenDOAR doesn't have a record of a metadata re-use policy, and sometimes seems to mean that OpenDOAR knows that there is no explicit metadata re-use policy defined by the repository. Of the remaining repositories, for a large proportion (47) OpenDOAR records "Metadata re-use permitted for not-for-profit purposes", and for a further 18 "Commercial metadata re-use permitted"."

It might be suggested that machine-readability is actually secondary to what is potentially misconceived policy in the first place - or which hasn't perhaps been fully thought through and at the very least is fatally fragmented across the sector - and that arguably is the result of lip-service rather than based on what actually happens in the real (virtual) world.

For my own part, in my institutional role, I was very, er, green (no pun intended) when I defined our repository policies back in 2008 using the OpenDOAR policy creation toolkit - http://www.opendoar.org/tools/en/policies.php - and to be frank I haven't really revisited them since. I suspect I'm not terribly unusual. To quote Owen once more, "the situation is even less clear for fulltext content than it is for metadata. OpenDOAR lists 54 repositories with the policy summary "Full data item policies explicitly undefined", but after that the next most common (29 repositories) policy summary (as recorded by OpenDOAR) is "Rights vary for the re-use of full data items" - more on this in a moment. OpenDOAR records "Re-use of full data items permitted for not-for-profit purposes" for a further 20 repositories, and then (of particularly interest for CORE) 16 repositories as "Harvesting full data items by robots prohibited".

The (reasonably unrestrictive) metadata and full-text policies I chose at Leeds Metropolitan University state that "the metadata may be re-used in any medium without prior permission for not-for-profit purposes and re-sold commercially provided the OAI Identifier or a link to the original metadata record are given" and "copies of full items generally can be reproduced, displayed or performed, and given to third parties in any format or medium for personal research or study, educational, or not-for-profit purposes without prior permission or charge". Even this, with the word "generally" implicitly recognises the fact that there may be different restrictions that apply to different items which to some extent reflects the complexity of negotiating copyright for green OA, not to mention the other types of records that repositories may hold (e.g. our repository also comprises a collection of Open Educational Resources [OER] which are in fact licensed at the record level with a Creative Commons URI in dc:rights as in this example - http://repository-intralibrary.leedsmet.ac.uk/IntraLibrary-OAI?verb=GetRecord&identifier=oai:com.intralibrary.leedsmet:2711&metadataPrefix=oai_dc)

Nor are my policies available in a machine readable form (which as we've established is typical across the sector) and I'm not actually sure how this could even be achieved without applying a standard license like Creative Commons?

Owen goes on to consider "What does Google do?", if you haven't already it's certainly worth reading the post in full but he concludes that "Google, Google Scholar, and other web search engines do not rely on the repository specific mechanisms to index their content, and do not take any notice of repository policies". Indeed, I think in common with many repository managers, I devote a lot of time and effort on SEO to ensure my repository is effectively indexed by Google et al and that full-text content can be discovered by these global search engines...which seems somewhat perverse when our own parochial mechanisms and fragmented institutional policies make it so difficult to build effective services of our own.

Monday 19 March 2012

UKCoRR, RSP and DRF - Japan and the UK in Agreement

As you'll have probably seen last week on the lists UKCoRR, in collaboration with the RSP and Japan's DRF (Digital Repository Federation) have signed a memorandum of understanding.

 
The Memorandum includes a commitment to
  • Sharing experience and expertise
  • Inviting and possibly sponsoring representatives from partners to participate in RSP and DRF events
  • Joint efforts to seek funding and/or support

Obviously from UKCoRR's perspective (and being unfunded as we are) we're mostly about the first option in the agreement; but all the same it's the first time we've signed up to an international agreement and is something that all members can be proud of - the furtherance of recognition of the importance of the repository worker and manager around the world. 

 
You can read more about this, and view the memorandum on the RSP's pages.

Tuesday 6 March 2012

The Unfulfilled Promise of Aggregating Institutional Repository Content (Guest Post)

Our thanks to Neil Stewart, Digital Repository Manager at City University London for the following guest post which raises some interesting questions for us all.
----
A very good question was posed on Stephen Curry’s blog by Björn Brembs recently (Curry and Brembs are a couple of the more prominent figures supporting the Elsevier boycott):
I’ve always wondered why the institutional repositories aren’t working with, e.g. PubMed etc. to make sure a link to their version is displayed with the search results. I mean, how difficult can this be?
This got me thinking, how difficult can it be? Aggregating and re-using institutional repository (IR) content at subject level is, after all, one of the promises of the Green road to Open Access.
The infrastructure is already in place, in the form of the many OAI-PMH compliant institutional repositories out there, and there is also the SWORD client, which allows flexible transfer of repository content. Some examples do exist- for example the Economists Online service, which harvests material from selected economics research-intensive universities, then makes it available via a portal. But (to my knowledge) there has been no work done to provide a way of e.g. ensuring all a repository’s eligible physics content is automatically uploaded to ArXiv, or all biomedical research to UKPMC.
Subject repositories have gained critical mass in certain disciplines (to add to the examples above, see also RePEC for economics, SSRN for social sciences and DBLP for computer science), meaning that if a paper doesn’t appear there, it’s far less visible. This means that the incentive to post locally is greatly reduced- yes, your paper will appear in Google, but a paper in ArXiv will appear both in Google and in the native interface of the repository where everyone else in your discipline is depositing.
So if the infrastructure is there and the rationale to create these links exists, why has it not been happening to any meaningful extent already? I suspect it’s because of the fact that the IR landscape is, by its very nature, a fragmented one. Those with responsibility for IRs (managers, IT people, and senior management) are understandably concerned with local issues: ensuring that IRs are properly managed and integrated with the university’s systems, as well as the usual open access and service awareness-raising and advocacy. Having time to think about the automatic population of ArXiv with papers from your home repository is probably pretty far down one’s to-do list.
That’s not to say that repository managers are oblivious to these issues- but here another problem arises. Few individual repository managers, I would guess, would think that they individually could negotiate with and persuade ArXiv that  automatic harvesting of physics content from their repository, and their repository alone, would be worth ArXiv’s while. This is, perhaps, where UKCoRR (or other national bodies- JISC perhaps?) might come in. If ArXiv or similar subject repositories could be persuaded of the merits of harvesting IR content (whether full text or metadata pointing back to IR holdings), it would allow all repositories to plug in to this system, and offer it as a service to academics (two for one deposit- local IR and ArXiv at the same time!)
So, what do people think? Is there any appetite for turning this into a project that UKCoRR members could take forward, perhaps with UKCoRR and/ or JISC oversight? Comments please!