Skip to content

Instantly share code, notes, and snippets.

@markpbaggett
Last active February 19, 2024 14:08
Show Gist options
  • Save markpbaggett/6370951f664eddc7ae4d73d68e966c2b to your computer and use it in GitHub Desktop.
Save markpbaggett/6370951f664eddc7ae4d73d68e966c2b to your computer and use it in GitHub Desktop.

Weekly Report - February 12 - 16 2024

Tasks / Worked On

  • RFTA Artists bug is fixed.
    • Fixed OAI logic bug.
    • Notified Ex Libris that I had worked out problem.
    • Need to add tests so that this doesn't happen again going forward.
  • Tested Workflows so that we can finally resume migration.
    • Long story short, if Ldp:HttpError, Rebuild. If not, follow standard process.
    • Need to document errors, when to use what, etc.
    • Need to make sure everything is on production.
    • Next steps are to resume Gamble Phase 2 and step through migrations hendered specifically by this problem.
  • Ran meeting with Emily for AV Annotations Interest Group.
    • Feedback is clear that we are pushing back against editors and advocating for accessibility and semantics.
    • Notes are here.
  • Met with Grad School and Bepress representatives to discuss problems.
    • Major takeaways: it's possible in Digital Commons to refactor workflows so that everything on paper happens in system.
    • Decisions now rest with grad school.
    • Next step: Install Vireo.
  • Worked with Josh to investigate workflows to generating OCR preingest.
    • Experimented with OCRS.
    • Refined use of Tesseract.
    • See below for thoughts.
  • Provided expectations for Issue-105.
  • Reviewed Jennifer Mezick's Tenure and Promotion Documentaiton and provided feedback.
import os
import subprocess
from tqdm import tqdm

class OCRGenerator:
    def __init__(self, path):
        self.path = path
        self.files = self.__crawl()

    def __crawl(self):
        for root, dirs, files in os.walk(self.path):
            for file in files:
                if file.endswith(".tif"):
                    yield os.path.join(root, file)

    @staticmethod
    def process_hocr(file):
        hocr_command  = f"tesseract {file} {file.replace('.tif', '')} -l eng hocr"
        try:
            result = subprocess.run(hocr_command, shell=True, capture_output=True, text=True)
            if result.returncode == 0:
                return result.stdout
            else:
                # If the command failed, raise an exception with stderr
                raise RuntimeError(result.stderr)
        except Exception as e:
            print("An error occurred:", e)

    @staticmethod
    def process_ocr(file):
        hocr_command = f"tesseract {file} {file.replace('.tif', '')} -l eng"
        try:
            result = subprocess.run(hocr_command, shell=True, capture_output=True, text=True)
            if result.returncode == 0:
                return result.stdout
            else:
                # If the command failed, raise an exception with stderr
                raise RuntimeError(result.stderr)
        except Exception as e:
            print("An error occurred:", e)


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('-p', '--path_to_files', help='Specify Path to Run OCR Against', default='/home/mbagget1')
    args = parser.parse_args()
    ocr = OCRGenerator(args.path_to_files)
    for file in tqdm(ocr.files):
        ocr.process_hocr(file)
        ocr.process_ocr(file)

Next Week

  • Take care of travel arrangements for Code4Lib and IIIF.
  • Meeting with Kirk this morning.
  • Work on import of new collections for Josh.
  • Test how we need to approach imports on qa. When do we stay in the same importer, and when do we break out? 😕
  • Get migration back on track.
  • Script something that will crawl failed filesets and check that they are indeed working after "fix".
  • Document various import errors, what they mean, and how to address each.
  • Think about and respond to scientist-softserv/utk-hyku#18
  • Continue drafting procedures, policies, and practices for DOI minting at UTK beyond the Libraries.
  • Follow through on commitment to investigate and report on Regression A/V Problem in UniversalViewer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment