Created
July 2, 2019 15:52
-
-
Save seanupton/3f91a93854200468b096969d773fb525 to your computer and use it in GitHub Desktop.
PDF Ingester
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
path = '/some/path/to/publication/sn99999999' | |
ingester = NewspaperWorks::Ingest::PDFIngester.new(path) | |
# if LCCN cannot be determined and validated from path, then raise an error | |
# during construction above... | |
# - Presumption: no character padding in LCCNs in either path or | |
# in any command form. | |
# - Normalize LCCN: | |
# - Strip whitespace? | |
# - Make lower case. | |
# - Validate LCCN with regex: | |
# - must begin with 1-3 digit alpha record prefix | |
# - must contain 8 or 10 subsequent digits. | |
# def normalized_lccn(v) | |
# p = /^[A-Za-z]{0,3}[0-9]{8}([0-9]{2})?$/ | |
# v = v.gsub(/\s+/, '').downcase.slice(0,13) | |
# raise ArgumentError, "LCCN appears invalid: #{v}" unless p.match(v) | |
# end | |
ingester.lccn == 'sn99999999' # => true | |
ingester.issues.class == NewspaperWorks::Ingest::PDFIssues # => true | |
ingester.ingest | |
# => Ensures title exists with proper metadata for LCCN, saved. | |
# => Syncrhonously loops through each issue: | |
# 1. creates a work for each, | |
# 2. saved w/ issue metadata (publication_date, edition_number, lccn, title). | |
# 3. Ensures administrative metadata is set to match opts | |
# 4. Ensures that issue work is in publication.ordered_members | |
# 5. Attaches PDF via: | |
# attachment = WorkFiles.of(work) | |
# attachment.assign(pdf_path) | |
# attachment.commit! | |
# 6. queues CreateIssuePagesJob.perform_later(work, [pdf_path], nil, nil) | |
# - user and admin_set args are nil, becuase they are already to be | |
# set in step 3 above. | |
# - Most of this work is logged into sidekiq or similar job runner log. | |
# alternate construction: | |
ingester = NewspaperWorks::Ingest::PDFIngester.new(path, lccn='sn88888888') | |
# construction shall not attempt to care or parse LCCN from directory structure | |
ingester.ingest | |
# performs as above. | |
# The ingester will have an alternate constructor self.from_command, | |
# used by rake task; will supported rake task arguments via OptionParser: | |
# --path: path to directory containing issues | |
# --lccn: LCCN of pub | |
# --admin_set: name of admin set | |
# --depositor: name of user depositing | |
# --visibility: visibility of | |
# Out of scope: | |
# Each instance of PDFingester is tied to ONE PUBLICATION. | |
# However, there might be a class method that could hanldle a rake task | |
# pointed at a parent directory, containing multiple sub-dirs per LCCN. | |
# This is TBD later. ^^^ | |
# The interface for traversing issues, the PDFIssues class: | |
path = '/some/path/to/publication/sn99999999' | |
ingester = NewspaperWorks::Ingest::PDFIngester.new(path) | |
ingester.issues | |
# Get size | |
ingester.issues.size | |
# PDFIngester is enumerable of key (path) / value (PDFIssue object) pairs, | |
# and presents a hash-like interface. | |
# Get paths: | |
ingester.issues.keys | |
# Enumerate over each issue: | |
ingester.issues.each do |path, issue| | |
issue.lccn == ingester.lccn # => true | |
issue.publication_date # => and ISO 8601 date stamp (from file name) | |
issue.edition_number # => found edition number from file name or 1 | |
issue.title # title containing publication title, date, optionally edition # | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment