Skip to content

Instantly share code, notes, and snippets.

@nichtich
Created February 17, 2011 16:32
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save nichtich/832052 to your computer and use it in GitHub Desktop.
Save nichtich/832052 to your computer and use it in GitHub Desktop.
Make use of VIAF authority records
#!/usr/bin/perl
=head1 NAME
viaflookup.pl - How to make use of VIAF authority records
=head1 VERSION
Version 0.2 - 2011-02-18
=cut
use strict;
use LWP::Simple;
use Data::Dumper;
use CGI qw(escape param header);
use JSON;
use Carp;
=head1 DESCRIPTION
The L<Virtual International Authority File|http://www.viaf.org> (VIAF)
combines authority files from more than a dozen libraries and countries.
This script implements and describes the use of VIAF API. See also the
L<developer documentation|http://www.oclc.org/developer/services/viaf>
provided by OCLC. To best make use of VIAF, you should be familiar with
its basic concepts: authority schemes, authority agencies, authority
records, and identfiers. Some L<basic notes about RDF|/RDF> are given below.
=head2 Schemes
In VIAF B<authority files> are called "authority schemes". An authority
file or scheme is a collection of controlled L<authority records|/Records>
that uniquely identify a person, institution, or another concept. Some also
call this an "Knowledge Organization System". Authority files are used in
library institutions since more than a century. However, these schemes have
little been connected among one another. This is where VIAF comes into play.
VIAF connects records from several authority schemes. In VIAF each scheme
is identified by a scheme identifier, for instance C<DNB> for the German
"Gemeinsame Normdatei" (GND). Most identifiers are uppercase letters, but
it also occurrs in lowercase at some places and some identifiers mix uppercase
and lowercase, so there seems to be no strictly unique form. VIAF defines an
URI for each scheme, based on its identifier, for instance
L<http://viaf.org/authorityScheme/DNB>. The scheme identifier is also used
to identify L<agencies|/Agencies>. This mixing of concepts can be seen as
bug or a feature, but you need to take care when using VIAF.
For each scheme there is a little icon that can be accesed by its identifier,
for instance L<http://viaf.org/viaf/images/flags/NLIheb.png>.
=head2 Agencies
An agency is some organization that publishes at least one authority scheme.
Most agencies in VIAF only provide one scheme, so they are just identified
by their scheme's identifier. For instance C<DNB> for the German National
Library. Institutions that provide more than one scheme, have a different
identifier, for instance C<NLI> for the National Library of Israel. It is
assumed that VIAF defines an URI for each agency, for instance
L<http://viaf.org/authorityAgency/DNB> but these URIs and the connection
between agencies and their schemes are not pulished in RDF yet.
A list of participating institutions in VIAF can be found at the VIAF
homepage L<http://viaf.org/> in HTML. For each agency there is an icon
as well, for instance L<http://viaf.org/viaf/images/flags/RERO.png>.
=head2 Records
Authority records are identified by their scheme and a local identifier
within this scheme. VIAF combines both parts to form one identifier, but
there are several forms:
=over
=item Simple string form
Scheme identifier and local identifier, combined with a vertical bar.
For instance C<LC|n 50034593> identifies a library of congress name
authority record.
=item Processed string form
Scheme identifier and normalized local identifier, combined with a
vertical bar. For instance C<LC|n 50034593>. Normalization rules
depend on the particular authority scheme.
=item URI forms
The simple string form is used to define several URIs. However these
URI are not suitable as permanent linked data URIs because of problems
with encoding of characters that don't belong in URIs and missing
content negotiation. For each record there are at least the following
URIs (given as example for the record C<LC|n 50034593>.
=over
=item L<http://viaf.org/processed/LC%7Cn%2050034593>
A representation of the source record that VIAF used for mapping
(in MARCXML format).
=item L<http://viaf.org/processed/LC%7Cn%2050034593#skos:Concept>
The authority record.
=item L<http://viaf.org/viaf/sourceID/LC%7Cn%2050034593>
A HTTP 302 redirect to the mapped VIAF record.
=back
Most Authority files should define their own, clean, strict, and
resolveable URIs for authority records. If there is such an URI,
you should be able to construct it from the local identifier.
Depending on the identifier structure, the institution may need to
define some normalization, for instance as described here for LCCN:
L<http://www.loc.gov/marc/lccn-namespace.html#normalization>
For instance L<http://d-nb.info/gnd/118540475> is a much better
URI than L<http://viaf.org/processed/DNB%7C118540475#skos:Concept>.
=back
VIAF records are just one special kind of authority records, that
contain mappings to other authority records. You can get VIAF records
in different formats (VIAF-XML, MARCXML, UnimarcXML, RDF, JSON).
=cut
# agencies and schemes
my $schemes = {
BAV => {
name => 'Vatican Library',
},
BNE => {
name => 'Biblioteca Nacional de España',
},
BNF => {
name => 'Bibliothèque nationale de France',
records => 'http://catalogue.bnf.fr/ark:/12148/cb$1' #t
},
DNB => {
name => 'Gemeinsame Normdatei',
records => 'http://d-nb.info/gnd/$1'
},
EGAXA => {
name => 'Bibliotheca Alexandrina',
},
ICCU => {
name => 'Italian National Catalog',
},
JPG => {
name => 'Getty ULAN',
},
JPGRI => {
name => 'Getty Research Institute',
},
LAC => {
name => 'Library and Archives Canada',
},
LC => {
name => 'Library of Congress Authorities',
short => 'LOC',
# see http://www.loc.gov/marc/lccn-namespace.html#normalization
filter => sub {
s/ |\/.*//g; # remove all blanks and characters after forward slash
if ( $_ =~ /^([^-]+)-(.*)$/ and length($2) < 6 ) {
return $1 . ('0'x(6 - length($2))) . $2;
} else {
return $_;
}
},
pattern => qr/^([a-z]*\d+)$/,
records => 'info:lccn/$1'
},
NKC => {
name => 'National Library of the Czech Republic',
},
NLA => {
name => 'National Library of Australia',
},
NLI => {
name => 'National Library of Israel',
},
NLIara => {
name => 'National Library of Israel',
},
NLIcyr => {
name => 'National Library of Israel',
},
NLIheb => {
name => 'National Library of Israel',
},
NLIlat => {
name => 'National Library of Israel',
},
NSZL => {
name => 'National Széchényi Library (Hungary)'
},
NUKAT => {
name => 'NUKAT, Poland'
},
PTBNP => {
name => 'Biblioteca Nacional de Portugal',
},
RERO => {
name => 'RERO (Switzerland)'
},
SELIBR => {
name => 'National Library of Sweden',
records => 'http://libris.kb.se/auth/$1'
},
SWNL => {
name => 'Swiss National Library',
},
VIAF => {
name => 'Virtual International Authority File',
uri => 'http://viaf.org/viaf/$1/',
},
};
=head2 Making use of VIAF
VIAF provides a large amount of information. Some typical queries are:
=over
=item Find authority records for a person
Given a name you want to know whether and which authority records
exist, so you can create links to an authority. Linking to authorities
is best practice in cataloging, so this is an important query.
In VIAF you can either search by name per SRU or per a simple REST
API. To only find authority records you better use the latter. Here
is an example query:
L<http://viaf.org/viaf/AutoSuggest?query=Emma%20Goldman>
The result is a JSON document that echoes the normalized C<query> and
gives a (possibly empty) ordered list of VIAF records as C<result>. Each
VIAF record contains the full name of a person as C<term> and local
authority record identifiers. The scheme is used in lowercase.
=cut
use LWP::UserAgent;
use HTTP::Request::Common;
my $ua = LWP::UserAgent->new;
# my $format = param('format'); # TODO: seealso, rdf, etc.
my $search = param('search') || "";
$search =~ s/\n\r//;
my $suggest = 0; #param('suggest'); # TODO
my $id = 0; #param('id');
print header('text/plain; charset=UTF-8');
binmode *STDOUT, ":utf8";
if ($search) {
my @clusters = searchName( $search );
# print "$search\n";
foreach (@clusters) {
print $_->condensed . "\n";
}
} elsif ($suggest) {
# search for name
my $url = 'http://viaf.org/viaf/AutoSuggest?query=' . escape($suggest);
# print "URL:$url\n";
my $json = decode_json(get($url));
if ( $json && $json->{result} ) {
#print Dumper($json);
foreach (@{$json->{result}}) {
handle_record ($_);
}
}
} elsif($id) {
if ($id =~ /^([A-Za-z]+)$/) {
# TODO: get information about a scheme or agency
} elsif ($id =~ /^(VIAF\|)?(\d+)$/) {
my $url = "http://viaf.org/viaf/$id";
# TODO
} elsif($id =~ /^([A-Za-z]+)[|:](.+)$/ and $schemes->{$1}) {
my $url = 'http://viaf.org/viaf/sourceID/'.escape("$1|$2");
my $request = HTTP::Request->new( GET => $url, [ ] );
my $response = $ua->request( GET $url, ['Accept'=>'application/rdf+xml'] );
# http://viaf.org/viaf/sourceID/LC%7Cn%2050034593
} else {
#print STDERR "Unknown id format\n";
}
}
sub handle_record { # FIXME
my $r = shift;
my @keys;
foreach my $prefix (keys %$r) {
next if $prefix eq 'term';
my $local = $r->{$prefix};
$prefix = uc($prefix);
print "$prefix|$local";
if ( $schemes->{$prefix} && $schemes->{$prefix}->{records} ) {
my $uri = $schemes->{$prefix}->{records};
my $pattern = $schemes->{$prefix}->{pattern} || qr/^(\d+)$/;
if ($local =~ $pattern) {
my ($a,$b) = ($1,$2); # TODO: $3, $4, ...
$uri =~ s/\$1/$a/;
$uri =~ s/\$2/$b/;
print " = $uri";
}
}
print "\n";
}
print "\n";
}
=head2 searchName
Search for a name in VIAF. Internally this method performs an SRU Query.
Returns a (possibly empty) list of up to 10 L<VIAF::Cluster> records.
=cut
sub searchName {
my $name = shift;
$name =~ s/['"\\]//g;
# retrieve response in VIAF-XML. Alternatively we could use RDF/XML
my $url = "http://viaf.org/viaf/search?version=1.1&operation=searchRetrieve"
. "&maximumRecords=10&httpAccept=text/xml"
. "&query=" . escape("local.personalNames all \"$name\"");
eval { use XML::XPath; };
croak "Missing XML::XPath module to parse SRU response" if $@;
my $xml = get($url);
#my $fh; open ($fh, "<", "viaf.xml");
#my $xml = join("\n",<$fh>);
my $xpath = XML::XPath->new( xml => $xml );
$xpath->set_namespace('v','http://viaf.org/viaf/terms#');
my @clusters;
foreach my $cluster ( $xpath->findnodes('//v:VIAFCluster[v:nameType="Personal"]') ) {
my $type = $xpath->findvalue('v:nameType', $cluster);
my $id = $xpath->findvalue('.//v:viafID', $cluster);
my $term = $xpath->findvalue('(v:mainHeadings//v:text)[1]', $cluster);
my $c = VIAF::Cluster->new( viaf => $id, term => $term );
# TODO: extract link to WorldCat Identities, Wikipedia, and DBPedia...
foreach my $source ( $xpath->findnodes( './/v:source', $cluster ) ) {
my $id = $source->string_value();
if ( $id =~ /^([A-Za-z]+)\|(.+)$/ ) {
$c->add($1,$2);
}
}
push @clusters, $c;
}
return @clusters;
}
# Example:
# http://viaf.org/viaf/39377930/
# http://www.worldcat.org/wcidentities/lccn-n50-34593
# http://wikipedia.org/wiki/Emma_Goldman
# http://dbpedia.org/resource/Emma_Goldman
#package VIAF;
package VIAF::Cluster;
use Scalar::Util qw(refaddr);
sub new {
my $class = shift;
my $self = bless { @_ }, $class;
return $self;
}
sub add {
my ($self,$prefix,$id) = @_;
unless ( $schemes->{$prefix} ) {
foreach my $key ( keys %$schemes ) {
next unless lc($key) eq lc($prefix);
$prefix = $key;
last;
}
}
my $scheme = $schemes->{$prefix} || return;
# TODO: normalize id
$self->{$prefix} = $id;
}
sub uri {
my $self = shift;
return "http://viaf.org/viaf/" . $self->{viaf} if $self->{viaf};
}
sub bnode {
my $self = shift;
return refaddr($self);
}
sub uri_nt {
my $self = shift;
return $self->{viaf} ? "<".$self->uri.">" : "_:b".$self->bnode;
}
sub condensed {
my $self = shift;
my $string = join( ";",
map { uc($_)."|".$self->{$_} }
grep {$_ ne 'term'} keys %$self
);
$string .= " = " . $self->{term} if $self->{term};
return $string;
}
1;
=head2 NOTES
=head3 RDF
VIAF and authority files do not depend on RDF, but RDF is a good technology
to make use of authority data. The basic ontology for authority schemes is
the L<Simple Knowledge Organization System|http://www.w3.org/2004/02/skos/>
(SKOS). The core concepts of VIAF are mapped to the following parts of SKOS:
=over
=item Schemes
L<http://www.w3.org/2008/05/skos#Scheme|skos:Scheme>.
=item Agencies
...
=item Records
L<http://www.w3.org/2008/05/skos#Scheme|skos:Concept>.
=back
The current RDF representation of VIAF data uses the outdated version
of SKOS ontology with namespace L<http://www.w3.org/2004/02/skos/core#>. You
should replace all SKOS classes and properties by their counterpart from the
new SKOS ontology with namespace L<http://www.w3.org/2008/05/skos#>.
In addition to SKOS, VIAF defines its own ontology that is located at
L<http://viaf.org/>. To access RDF data from this and other URIs, you need
to send a HTTP request with a special C<Accept> header to tell the server
that you want no HTML page but RDF data. Most RDF tools do this for you.
I recommend to command line tool C<rapper>.
=head1 AUTHOR
Jakob Voss C<< <jakob.voss@gbv.de> >>
=head1 LICENSE
Copyright (C) 2011 by Verbundzentrale Goettingen (VZG) and Jakob Voss
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself, either Perl version 5.8.8 or, at
your option, any later version of Perl 5 you may have available.
@spchamp
Copy link

spchamp commented Jun 26, 2014

The VIAF developer documentation has been moved, it seems - now available at
http://www.oclc.org/developer/develop/web-services/virtual-international-authority-file-viaf.en.html

The OCLC has also published a nice developer handbook, available:
http://www.oclc.org/developer/develop/web-services.en.html

The developer handbook does not denote VIAF, but maybe it could be useful towards using the OCLC web APIs.

VIAF in the OCLC API Explorer:
https://platform.worldcat.org/api-explorer/VIAF

It seems that the "Jane Ausitin" resource ID from the example in the API explorer has been updated,

Regarding the SRU syntax used in the API SRUSearch function:
http://www.loc.gov/standards/sru/

Raw VIAF data, in RDF, MARC-21, and plain text formats:
http://viaf.org/viaf/data/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment