Skip to content

Instantly share code, notes, and snippets.

@seanupton
Last active October 24, 2019 15:09
Show Gist options
  • Save seanupton/c4a0680f36030802889f0209ac27110c to your computer and use it in GitHub Desktop.
Save seanupton/c4a0680f36030802889f0209ac27110c to your computer and use it in GitHub Desktop.
Slide notes / ingest video script

Slide 3: Newspaper work types and relationships

NewspaperWorks provides five new work types to a Hyrax application:

  • Newspaper titles represent a publication, and contain Issues and Containers.
  • Issues can contain Pages, and Articles.
  • Articles, if you have segmented content, can be associated with one or more pages.
  • Optionally, Pages can relate to Containers in the same publication for respective microform reels.
  • You can see read more about these types on the NewspaperWorks repository wiki in samvera-labs.

[NEXT SLIDE PLEASE]

Slide 4: Newspaper metadata fields

NewspaperWorks has a rich set of descriptive metadata fields for these work types. This includes support for a place of publication field integrated with autocomplete from GeoNames, and a variety of other fields including edition name and number, volume number, issue number, section name, page number.

Our ingest processes has automatic lookup and population of key publication metadata values from authorities at the Library of Congress, including the LCCN Permalink Service and Chronicling America.

[NEXT SLIDE PLEASE]

Slide 5: Batch Ingest Workflows

We support batch ingests for three common batch formats:

  • Batches conforming to the National Digitial Newpaper Program (or NDNP) from the Library of Congress
  • PDF issues, either uploaded through the browser, or ingested in quantity from a command line rake task.
  • TIFF or JP2 page images, also by rake task.

While the NDNP batches provide explicit batch manifests, our other ingests rely upon a simple file and directory naming convention to ensure that metadata and page order needs are met for ingested materials.

[NEXT SLIDE PLEASE]

Slide 6: NDNP ingest

"Let's ingest an NDNP batch..."

In a terminal window, from the root directory of the application integrating NewspaperWorks, we run a rake task to ingest a batch conforming to NDNP specifications, containing multiple issues from multiple publications.

Many institutions have already created batches newspaper assets in this format — and many sample batches are made freely available online.

This makes NDNP batches a good way to get started using or evaluating NewspaperWorks with minimal effort.

[NEXT SLIDE PLEASE]

Slide 7: TIFF Ingest

"NewspaperWorks has batch ingest support for both page images and PDF issues..."

"We have a rake task to ingest a batch of these simple files — if the file and directory naming give us date, lexical ordering of pages, and the LCCN of the publication, we can populate usable content from those image assets alone.

Here, there's a rake task ingesting TIFF files that are organized in directories and files named to provide that simple information"

[NEXT SLIDE PLEASE]

Slide 8: PDF issue upload

Ingesting a single Newspaper Issue is as easy as uploading a multi-page PDF file for that issue.

...uploading this in a browser, or as part of a command-line batch, automatically creates child Newspaper Page works, complete with derivatives and OCR for text indexing and word coordinates of each page.

The result, after all ingest jobs are complete, is an issue containing pages you can navigate through using next, previous, and breadcrumb navigation.

You can also get the ingest multiple PDF issues for a single publication easily with the same rake task demonstrated for TIFF files in the previous slide.

[NEXT SLIDE PLEASE]

Slide 9: Batch ingest actions

"Behind the scenes, there is a lot going on to make sure that batch-ingested content is complete and usable.

Publications for each issue are found, or created as needed, and the entire hierarchy of pages, issues, and publication title works are created in the repository.

Available descriptive metadata is set, and a rich set of derivatives is created by a custom and pluggable series of derivative creation plugins. In the case of NDNP assets, we also ingest pre-made derivatives without the burden of duplicating creation.

Finally, default administrative metadata is set, but if you wish to use a different depositor, admin set, or visibility for the works you ingest, this can be specified."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment