Skip to content

Instantly share code, notes, and snippets.

@seanupton
Created July 2, 2019 15:52
Show Gist options
  • Save seanupton/3f91a93854200468b096969d773fb525 to your computer and use it in GitHub Desktop.
Save seanupton/3f91a93854200468b096969d773fb525 to your computer and use it in GitHub Desktop.
PDF Ingester
path = '/some/path/to/publication/sn99999999'
ingester = NewspaperWorks::Ingest::PDFIngester.new(path)
# if LCCN cannot be determined and validated from path, then raise an error
# during construction above...
# - Presumption: no character padding in LCCNs in either path or
# in any command form.
# - Normalize LCCN:
# - Strip whitespace?
# - Make lower case.
# - Validate LCCN with regex:
# - must begin with 1-3 digit alpha record prefix
# - must contain 8 or 10 subsequent digits.
# def normalized_lccn(v)
# p = /^[A-Za-z]{0,3}[0-9]{8}([0-9]{2})?$/
# v = v.gsub(/\s+/, '').downcase.slice(0,13)
# raise ArgumentError, "LCCN appears invalid: #{v}" unless p.match(v)
# end
ingester.lccn == 'sn99999999' # => true
ingester.issues.class == NewspaperWorks::Ingest::PDFIssues # => true
ingester.ingest
# => Ensures title exists with proper metadata for LCCN, saved.
# => Syncrhonously loops through each issue:
# 1. creates a work for each,
# 2. saved w/ issue metadata (publication_date, edition_number, lccn, title).
# 3. Ensures administrative metadata is set to match opts
# 4. Ensures that issue work is in publication.ordered_members
# 5. Attaches PDF via:
# attachment = WorkFiles.of(work)
# attachment.assign(pdf_path)
# attachment.commit!
# 6. queues CreateIssuePagesJob.perform_later(work, [pdf_path], nil, nil)
# - user and admin_set args are nil, becuase they are already to be
# set in step 3 above.
# - Most of this work is logged into sidekiq or similar job runner log.
# alternate construction:
ingester = NewspaperWorks::Ingest::PDFIngester.new(path, lccn='sn88888888')
# construction shall not attempt to care or parse LCCN from directory structure
ingester.ingest
# performs as above.
# The ingester will have an alternate constructor self.from_command,
# used by rake task; will supported rake task arguments via OptionParser:
# --path: path to directory containing issues
# --lccn: LCCN of pub
# --admin_set: name of admin set
# --depositor: name of user depositing
# --visibility: visibility of
# Out of scope:
# Each instance of PDFingester is tied to ONE PUBLICATION.
# However, there might be a class method that could hanldle a rake task
# pointed at a parent directory, containing multiple sub-dirs per LCCN.
# This is TBD later. ^^^
# The interface for traversing issues, the PDFIssues class:
path = '/some/path/to/publication/sn99999999'
ingester = NewspaperWorks::Ingest::PDFIngester.new(path)
ingester.issues
# Get size
ingester.issues.size
# PDFIngester is enumerable of key (path) / value (PDFIssue object) pairs,
# and presents a hash-like interface.
# Get paths:
ingester.issues.keys
# Enumerate over each issue:
ingester.issues.each do |path, issue|
issue.lccn == ingester.lccn # => true
issue.publication_date # => and ISO 8601 date stamp (from file name)
issue.edition_number # => found edition number from file name or 1
issue.title # title containing publication title, date, optionally edition #
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment