Skip to content

Instantly share code, notes, and snippets.

@aschweer
Last active March 24, 2016 03:55
Show Gist options
  • Save aschweer/4669388 to your computer and use it in GitHub Desktop.
Save aschweer/4669388 to your computer and use it in GitHub Desktop.
Extracing CAUL IRR stats from a DSpace repository

Example code to extract statistics from a DSpace repository according to the CAUL IRR stats pilot:

  1. Number of complete works (1) held in the institutional repository/ies, available externally (open access)
  2. Number of complete works (2) held in the institutional repository/ies for internal access only (authorized user access or dark archive)
  3. Total number of complete works (3) held in the IR (1 and 2)
  4. Number of items (4) held in the institutional repository/ies (metadata records only)
  5. Total number of metadata records added during the year
  6. Total number of complete works added during the year
  7. Total number of items held in the institutional repository/ies (3 and 4)
  8. Number of accesses to complete works in the institutional repository during the year
  9. Number of accesses to metadata record items in the institutional repository during the year
  10. Total number of accesses to items held in the institutional repository (8 and 9).

This code was developed for the Library Consortium of New Zealand's IRRs.

How this works / limitations

This code retrieves all "number of" statistics from the DSpace database. It queries the Solr-based statistics to get usage data for each item identified as relevant. DSpace policies for items, bundles and bitstreams are consulted to determine whether a file is a "complete work" or a "metadata record". The repositories for which this code was written don't hold any "complete works for internal access only", so the code does not attempt to detect those. See the comments in IRRStatsController for a starting point to adopt this for other repositories.

The original intention was to write a DSpace aspect so this could be integrated into the administrative user interface. However, given that the IRR stats collection is still in a pilot phase, it was decided to stick to a version for now that needs to be run on the command line on the DSpace server.

Instructions

Save the two files in a directory structure corresponding to the Java package (nz/ac/lconz/irr/dspace/app/xmlui/aspect/irrstats/). Compile the files and make them available in the DSpace class-path, eg by creating a jar file and placing it into [dspace]/lib. Then run IRRStatsCreator from the command line via the dsrun command:

[dspace]/bin/dspace dsrun nz.ac.lconz.irr.dspace.app.xmlui.aspect.irrstats.IRRStatsCreator fromDate toDate filename

where filename is the name of the output file. This should probably be preceded by cleaning up bot data from the Solr statistics ([dspace]/dspace stats-util -u && [dspace]/dspace stats-util -i).

The output will be in CSV format as follows:

CountPublicFulltext,n
CountInternalFulltext,n
CountAnyFulltext,n
CountMetadataOnly,n
AddedMetadataOnly,n
AddedAnyFulltext,n
CountAll,n
AccessAnyFulltext,n
AccessMetadataOnly,n
AccessAll,n

the order of the metrics corresponds to that in the list above, which in turn is quoted from the IRR stats PDF document distributed by CAUL.

package nz.ac.lconz.irr.dspace.app.xmlui.aspect.irrstats;
import org.apache.log4j.Logger;
import org.apache.solr.client.solrj.SolrServerException;
import org.dspace.authorize.AuthorizeManager;
import org.dspace.content.*;
import org.dspace.core.Constants;
import org.dspace.core.Context;
import org.dspace.eperson.Group;
import org.dspace.statistics.ObjectCount;
import org.dspace.statistics.SolrLogger;
import java.sql.SQLException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
/**
* @author Andrea Schweer schweer@waikato.ac.nz for the LCoNZ Institutional Research Repositories
*/
class IRRStatsController {
private final static Logger log = Logger.getLogger(IRRStatsController.class);
protected static final DateFormat QUERY_FORMAT = new SimpleDateFormat(SolrLogger.DATE_FORMAT_8601);
private Map<Metric, Long> values = new HashMap<Metric, Long>();
long getValueFor(Metric metric) throws IllegalStateException {
return values.containsKey(metric) ? values.get(metric) : 0L;
}
void gatherData(Context context, Date startDate, Date endDate) throws StatsDataException {
try {
ItemIterator items = Item.findAll(context);
while (items.hasNext()) {
Item item = items.next();
try {
if (item.isArchived() && !item.isWithdrawn() && isPublic(context, item)) { // the LCoNZ IRRs don't have internal-only items
if (addedBy(endDate, item)) {
boolean addedInPeriod = addedSince(startDate, item);
if (hasPublicFulltext(context, item)) {
incrementValue(Metric.CountPublicFulltext);
addToValue(Metric.AccessAnyFulltext, countDownloads(context, item, startDate, endDate));
if (addedInPeriod) {
incrementValue(Metric.AddedAnyFulltext);
}
} else if (hasInternalFulltext(context, item)) {
incrementValue(Metric.CountInternalFulltext);
addToValue(Metric.AccessAnyFulltext, countDownloads(context, item, startDate, endDate));
if (addedInPeriod) {
incrementValue(Metric.AddedAnyFulltext);
}
} else if (isMetadataOnly(context, item)) {
incrementValue(Metric.CountMetadataOnly);
addToValue(Metric.AccessMetadataOnly, countPageViews(context, item, startDate, endDate));
if (addedInPeriod) {
incrementValue(Metric.AddedMetadataOnly);
}
}
}
}
} catch (RuntimeException e) {
log.error("Problem encountered with item " + item.getID() + ", not counting it");
} catch (SolrServerException e) {
log.error("Problem encountered with item " + item.getID() + ", not counting it");
}
item.decache();
}
sumUpValues(Metric.CountAnyFulltext, Metric.CountPublicFulltext, Metric.CountInternalFulltext);
sumUpValues(Metric.CountAll, Metric.CountAnyFulltext, Metric.CountMetadataOnly);
sumUpValues(Metric.AccessAll, Metric.AccessAnyFulltext, Metric.AccessMetadataOnly);
} catch (SQLException e) {
e.printStackTrace();
throw new StatsDataException(e);
}
}
private void sumUpValues(Metric targetMetric, Metric... constituents) {
long sum = 0L;
for (Metric constituent : constituents) {
if (values.containsKey(constituent)) {
Long value = values.get(constituent);
if (value != null) {
sum += value.longValue();
}
}
}
values.put(targetMetric, sum);
}
private boolean hasPublicFulltext(Context context, Item item) throws SQLException {
Bitstream[] bitstreams = item.getNonInternalBitstreams();
for (Bitstream bitstream : bitstreams) {
if (isPublic(context, bitstream)) {
return true;
}
}
return false;
}
private boolean hasInternalFulltext(Context context, Item item) {
return false; // hard-coded for LCoNZ IRRs, which don't have items with internal-only fulltext
}
private boolean isMetadataOnly(Context context, Item item) throws SQLException {
Bitstream[] bitstreams = item.getNonInternalBitstreams();
for (Bitstream bitstream : bitstreams) {
if (isPublic(context, bitstream)) {
return false;
}
}
return true;
}
private long countDownloads(Context context, Item item, Date startDate, Date endDate) throws SolrServerException {
StringBuilder query = new StringBuilder("type:");
query.append(Constants.BITSTREAM);
query.append(" AND owningItem:");
query.append(item.getID());
query.append(" AND time:[");
query.append(QUERY_FORMAT.format(startDate));
query.append(" TO ");
query.append(QUERY_FORMAT.format(endDate));
query.append("]");
ObjectCount downloads = SolrLogger.queryTotal(query.toString(), "-isBot:true");
return downloads.getCount();
}
private long countPageViews(Context context, Item item, Date startDate, Date endDate) throws SolrServerException {
StringBuilder query = new StringBuilder("type:");
query.append(Constants.ITEM);
query.append(" AND id:");
query.append(item.getID());
query.append(" AND time:[");
query.append(QUERY_FORMAT.format(startDate));
query.append(" TO ");
query.append(QUERY_FORMAT.format(endDate));
query.append("]");
ObjectCount pageViews = SolrLogger.queryTotal(query.toString(), "-isBot:true");
return pageViews.getCount();
}
private void incrementValue(Metric metric) {
addToValue(metric, 1L);
}
private void addToValue(Metric metric, long amount) {
long existing = 0L;
if (values.containsKey(metric)) {
Long value = values.get(metric);
if (value != null) {
existing = value.longValue();
}
}
values.put(metric, existing + amount);
}
private boolean addedBy(Date endDate, Item item) {
DCValue[] accessioned = item.getMetadata("dc", "date", "accessioned", Item.ANY);
try {
Date accessionDate = new DCDate(accessioned[0].value).toDate();
return !accessionDate.after(endDate);
} catch (RuntimeException e) {
return false;
}
}
private boolean addedSince(Date startDate, Item item) {
DCValue[] accessioned = item.getMetadata("dc", "date", "accessioned", Item.ANY);
try {
Date accessionDate = new DCDate(accessioned[0].value).toDate();
return !accessionDate.before(startDate);
} catch (RuntimeException e) {
return false;
}
}
private boolean isPublic(Context context, DSpaceObject dso) throws SQLException {
Group[] readGroups = AuthorizeManager.getAuthorizedGroups(context, dso, Constants.READ);
for (Group group : readGroups) {
if (group.getID() == 0) {
return true;
}
}
return false;
}
}
package nz.ac.lconz.irr.dspace.app.xmlui.aspect.irrstats;
import org.dspace.core.Context;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.sql.SQLException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.Locale;
/**
* @author Andrea Schweer schweer@waikato.ac.nz for the LCoNZ Institutional Research Repositories
*/
public class IRRStatsCreator {
public static void main(String[] args) throws SQLException, IOException {
if (args.length < 3) {
System.out.println("Usage: IRRStatsCreator fromDate toDate filename");
System.out.println(" Dates given as yyyy-MM-dd; start date is inclusive but end date is exclusive (eg for all of 2012, specify 2012-01-01 2013-01-01");
return;
}
Date startDate;
Date endDate;
try {
SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
dateFormat.setTimeZone(Calendar.getInstance().getTimeZone());
startDate = dateFormat.parse(args[0]);
endDate = dateFormat.parse(args[1]);
System.out.println("Using start date " + dateFormat.format(startDate));
System.out.println("Using end date " + dateFormat.format(endDate));
} catch (ParseException e) {
e.printStackTrace(System.err);
return;
}
Context context = null;
try {
context = new Context();
context.turnOffAuthorisationSystem();
IRRStatsController controller = new IRRStatsController();
controller.gatherData(context, startDate, endDate);
BufferedWriter writer = new BufferedWriter(new FileWriter(args[2]));
for (Metric metric : Metric.values()) {
writer.write(metric.name());
writer.write(",");
writer.write(Long.toString(controller.getValueFor(metric)));
writer.write("\n");
}
writer.flush();
writer.close();
} catch (StatsDataException e) {
e.printStackTrace(System.err);
} finally {
if (context != null) {
context.abort();
}
}
}
}
package nz.ac.lconz.irr.dspace.app.xmlui.aspect.irrstats;
public enum Metric {
CountPublicFulltext,
CountInternalFulltext,
CountAnyFulltext,
CountMetadataOnly,
AddedMetadataOnly,
AddedAnyFulltext,
CountAll,
AccessAnyFulltext,
AccessMetadataOnly,
AccessAll
}
package nz.ac.lconz.irr.dspace.app.xmlui.aspect.irrstats;
import java.sql.SQLException;
/**
* @author Andrea Schweer schweer@waikato.ac.nz for the LCoNZ Institutional Research Repositories
*/
public class StatsDataException extends Throwable {
public StatsDataException(SQLException cause) {
super(cause);
}
}
@garybrowne
Copy link

Hi Andrea,
I'd like to give this a try - but not sure how to compile - where should I place the package within the DSpace source hierarchy? If I try to compile outside of DSpace it cannot find all the package dependencies. eg:

gary@garylinux:~/Downloads/caul-dspace-stats/nz/ac/lconz/irr/dspace/app/xmlui/aspect/irrstats$ javac -g *.java
IRRStatsController.java:3: error: package org.apache.log4j does not exist
import org.apache.log4j.Logger;
                       ^
IRRStatsController.java:4: error: package org.apache.solr.client.solrj does not exist
import org.apache.solr.client.solrj.SolrServerException;
                                   ^
IRRStatsController.java:5: error: package org.dspace.authorize does not exist
import org.dspace.authorize.AuthorizeManager;
                           ^
IRRStatsController.java:6: error: package org.dspace.content does not exist
import org.dspace.content.*;
^
IRRStatsController.java:7: error: package org.dspace.core does not exist
import org.dspace.core.Constants;
                      ^
IRRStatsController.java:8: error: package org.dspace.core does not exist

Thanks a lot,
Gary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment