Skip to content

Instantly share code, notes, and snippets.

@jeremyf
Created April 5, 2024 14:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeremyf/19f2e61799de674bb8fa09e8898d5e6b to your computer and use it in GitHub Desktop.
Save jeremyf/19f2e61799de674bb8fa09e8898d5e6b to your computer and use it in GitHub Desktop.

Table of Contents

  1. 🐮🤠 Welcome to the (Derivative) Rodeo 🤠🐮
    1. Overview
    2. Problem Statement
    3. History
      1. Hydra::Derivatives
      2. Hyrax::DerivativeService
        1. Hyrax::FileSetDerivativesService
      3. NewspaperWorks
        1. Extending/Overriding NewspaperWorks
      4. IiifPrint
      5. Derivative::Rodeo DerivativeRodeo
        1. Why Make More Gems?
    4. There’s More Than One Way to Rodeo
      1. The Derivative Two-Step
      2. Foundations of the Rodeo
    5. Wrap Up
      1. Where Can I find the DerivativeRodeo?
      2. Why the derivative-rodeo and derivative_rodeo?
      3. What’s In A Name
      4. What’s Next
      5. 🐮🤠 Happy Trails 🤠🐮

🐮🤠 Welcome to the (Derivative) Rodeo 🤠🐮

Presented on <2023-05-03 Wed> at Samvera Virtual Connect 2023

Overview

“This ain’t my first rodeo.”1

In this talk I’ll go over:

  • The Problem Statement
  • History
  • The Rodeo
  • Wrap Up
    • What?
    • Where?
    • Why?

Problem Statement

Given that I have a million billion objects
When I ingest those objects
Then things are really slow

Also

Given that I have a million billion objects
And I already have a mixture of derivatives
When I ingest those objects
Then I really don’t want to recreate things I already have

History

Time why you punish me
Like a wave bashing into the shore
You wash away my dreams

Time why you walk away
Like a friend with somewhere to go
You left me crying

Can you teach me 'bout tomorrow
And all the pain and sorrow running free
'Cause tomorrow's just another day
And I don't believe in time

The Gems:

  • Hydra::Derivatives
  • Hyrax::DerivativeService
  • NewspaperWorks
    • Extending/Overriding NewspaperWorks
  • IiifPrint
  • DerivativeRodeo

Hydra::Derivatives

Said in the voice of Sophia from Golden Girls:

“Picture it: Minneapolis, 2013. A younger Justin Coyne creates a repository.”

img

The Hydra::Derivatives is a venerable and long-used gem for generating derivatives for the Samvera community. It’s very configurable and extensible.

Hyrax::DerivativeService

The Hyrax::DerivativeService implements the interface for generating derivatives for a FileSet. It uses the registered services to find the first valid one and then uses that to create the derivatives.

# @api public
#
# Get the first valid registered service for the given file_set.
#
# @param file_set [#uri, #file_set]
# @return [#cleanup_derivatives, #create_derivatives, #derivative_url]
def self.for(file_set, services: Hyrax.config.derivative_services)
  services.map { |service| service.new(file_set) }.find(&:valid?) ||
    new(file_set)
end

Hyrax::FileSetDerivativesService

The Hyrax::FileSetDerivativesService class leverages Hydra::Derivatives and by default is registered as the one and only .services. It is the long-standing approach for creating derivatives.

For each original file, we create derivatives of that original file based on its mime type.

NewspaperWorks

Created by the Boston Public Library and University of Utah, the NewspaperWorks gem introduced quite a few concepts:

  • models for Title, Issue, Page, and Article
  • batch ingest via command line
  • OCR and ALTO creation
  • newspaper-specific metadata fields
  • full-text search
  • calendar-based issue browsing
  • advanced search
  • OCR keyword match highlighting
  • viewer with page navigation and deep zooming

It does some of this by creating a new derivative service and registering that in the aforementioned Hyrax.config.derivative_services.

Extending/Overriding NewspaperWorks

For the NNP we leveraged the NewspaperWorks and made several modifications and omissions.

Fundamentally we wanted to:

  • Rip PDFs apart, one image per page
  • Run OCR on those images
  • Index the image text as part of the parent PDF

All in service of a more pleasant and responsive IIIF Viewer Experience for the PDFs.

IiifPrint

  • IiifPrint: The woefully incorrect name of a gem SoftServ has been working on.

It is subset of extracted features from the NewspaperWorks gem; the features we are seeing as common requests for our clients.

Guided by the use-case of NNP and other Hyku installations (e.g. British Library, Adventist, University of Tennessee Knoxville, etc.).

  1. Splitting a PDF into constituent pages, with a parent/child relationship.
  2. Returning parent works when children match the search criteria.
  3. IIIF Manifest includes parent/child relationships.
  4. Auto-assignment of parent/child relationship when splitting a PDF into constituent Pages.
  5. Text extraction, via tesseract, of text within an image.

It does some of this by creating a new derivative service and registering that in the aforementioned Hyrax.config.derivative_services.

Derivative::Rodeo DerivativeRodeo

Finally, the actual thing I’m here to talk about!

🐮🤠 The DerivativeRodeo is a further decomposition of the IiifPrint. 🤠🐮

In the future, IiifPrint will:

  • depend on the DerivativeRodeo
  • be renamed to something rodeo adjacent
  • provide the parent/child relationship management
  • search/indexing behavior

Why Make More Gems?

First, we want to do the PDF splitting and text extracting in a distributed environment (e.g. AWS Lambdas).

And given that we’re generating some derivatives in AWS Lambda, we want to be able to generate other derivatives in that space.

We also want to have our Lambda functions use the same code as our Monolith (but we definitely don’t want the monolith loaded in a lambda).

There’s More Than One Way to Rodeo

img

The Derivative Two-Step

In the previous diagram, the preprocess and import represent the AWS Lambdas and the Hyrax monolith. The primary idea being that each environment knows how, via the DerivativeRodeo, to find or create the requisite derivatives for the original file’s mime type.

Foundations of the Rodeo

At it’s core, the DerivativeRodeo orchestrates the following:

  • Checking if something already exists “here”…
  • Or fetching something when it exists “elsewhere” and put it “here”…
  • Or generating it “here”

What is “here”? It depends on the place where things are running.

Wrap Up

Today’s existing Hyrax implementation does not handle the case where we already have some (or all) of the desired derivatives. And if you’re looking to rip apart PDFs, the processing within Hyrax is slow.

Where Can I find the DerivativeRodeo?

If you’re not minting new gems, what are you doing?

SoftServ is iterating on these concepts. The Github repositories I’m referencing are:

Why the derivative-rodeo and derivative_rodeo?

Rob and I have been playing a game of tennis; in which we write up code to demonstrate our understanding of the problem.

We then respond to the code:

  • with questions
  • diagrams
  • conversations
  • refactoring
  • proposing alternate approaches

We do much of this asynchronously so we can work within Rob’s particularly challenging calendar constraints.

In our synchronous conversations, we include another developer to ensure that we’re delivering the most accessible code.

The “dash” rodeo was one exploration through code extraction, naming, and working through the process flow. The “underscore” rodeo is a further distillation and simplification of the “dash” rodeo.

What’s In A Name

In our conversations we reviewed RubyGems’s “Name your gem” Guide and the underscore is more idiomatic for Ruby.2

Hence we’re settling on derivative_rodeo.

I encourage you all to ask Rob how those naming conventions were established. He was there in the days of yore.

What’s Next

This is all in-progress work; some running in production across different Hykus. Our plan is to get both the derivative_rodeo and iiif_print into a suitable state for Samvera Labs and transfer them once they’ve stabilized.

🐮🤠 Happy Trails 🤠🐮

These notes will become a blog post; just need to wrangle up some time to do that.

Footnotes

1 an idiomatic American slang for “I’m prepared for what comes next.”

2 Personally, I like dashes; which are also a more universal word-boundary in regards to search engines and assistive technology.