Skip to content

Instantly share code, notes, and snippets.

Last active October 8, 2018 16:52
Show Gist options
  • Save cneud/ba595b0d70413c952d64154646f560cf to your computer and use it in GitHub Desktop.
Save cneud/ba595b0d70413c952d64154646f560cf to your computer and use it in GitHub Desktop.
SBB API docs

APIs of the Staatsbibliothek zu Berlin - Preußischer Kulturbesitz*

*(to the extent currently implemented)

Programmatic access to the digitised collections and digitised newspapers of the Staatsbibliothek zu Berlin - Preußischer Kulturbesitz (SBB) is currently possible via two distinct APIs.

Retrieval of metadata for objects in the digitised collections is established by use of the The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard. A wide range of client applications for OAI-PMH in numerous programming languages are freely available on the web.

The base URL for the OAI-PMH endpoint of the digitised collections of the SBB is

Using the 6 verbs provided by OAI-PMH, requests such as the following can be generated

  • "Which metadata formats are supported by the API?"

  • "What digital collections do exist?"

The SBB implements DublinCore (DC) for basic bibliographic metadata and METS for all metadata about the contents and structure of a digital object.

By combination of OAI-PMH verbs and the DC-Metadata, more specific requests can be formulated such as

  • "What digitised newspapers do exist?"

The response contains a unique identifier for each digital oject, the PPN, e.g. Using the PPN, additional information about a digital object can be retrieved

By changing the metadata-prefix to mets, the complete METS metadata record containing all references to any related files (images, OCR) can be retrieved

The METS file contains a section <fileSec> which holds child elements of the type <fileGrp> which contain references to various files that belong to the digital object, typically images in either JPG or PNG format...

...and OCRed text files in ALTO format

1a. Digitised collections: Other goodies

Retrieval of content (images and full-text) for digitised newspapers is supported via the International Image Interoperability Framework (IIIF) protocol. Also here a growing number of free clients and libraries for IIIF in numerous programming languages are available on the web.

Currently, digitised newspaper images and metadata can be retrieved by requests following this schema:{ZDB-ID}-{YYYYMMDD}-{Issue}-{Page}-{Article}-{Version}

The ZDB-ID is a unique identifier for every newspaper title and that can be found either within the ZEFYS newspaper portal or directly from the ZDB.

Next, a date of issue needs to be specified in the YYYYMMDD format, so e.g. 18900101 for the issue published on January 1st, 1890. The information about wich date ranges have already been digitised per newspaper title can again be found in the ZEFYS newspaper portal.

By then combining the page number 0 with the ending .xml in the URL, the metadata METS document for each newspaper title can be obtained, e.g.

[Please note that the functionality described here for retrieving OCR data is not currently implemented yet!]
By incrementing the page number, the OCRed text files in ALTO format can be requested: for page 1, for page 2 asf.

To retrieve the scanned images for the newspaper, further information needs to be specified in the URL, such as the addition of /full/{width in pixel},/0/default.jpg with width in pixel having the supported options 1200, 800, 250, e.g.,/0/default.jpg,/0/default.jpg,/0/default.jpg

It is also possible to retrieve the original TIFF images via IIIF by replacing the width in pixel with full and specifying default.tif instead of default.jpg in the URL like:

Some working examples: -> TIF page 1,/0/default.jpg -> JPG page 1,/0/default.jpg -> JPG page 1 (article 10 highlighted) -> PDF page 1 -> PDF all pages -> ALTO page 1 -> METS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment