markpbaggett/report_2-12-24.md

## report_2-12-24.md

      
    Raw
  

              report_2-12-24.md
            
          
    Weekly Report - February 12 - 16 2024

Tasks / Worked On


RFTA Artists bug is fixed.

Fixed OAI logic bug.
Notified Ex Libris that I had worked out problem.
Need to add tests so that this doesn't happen again going forward.


Tested Workflows so that we can finally resume migration.

Long story short, if Ldp:HttpError, Rebuild. If not, follow standard process.
Need to document errors, when to use what, etc.
Need to make sure everything is on production.
Next steps are to resume Gamble Phase 2 and step through migrations hendered specifically by this problem.


Ran meeting with Emily for AV Annotations Interest Group.

Feedback is clear that we are pushing back against editors and advocating for accessibility and semantics.
Notes are here.


Met with Grad School and Bepress representatives to discuss problems.

Major takeaways: it's possible in Digital Commons to refactor workflows so that everything on paper happens in system.
Decisions now rest with grad school.
Next step: Install Vireo.


Worked with Josh to investigate workflows to generating OCR preingest.

Experimented with OCRS.
Refined use of Tesseract.
See below for thoughts.


Provided expectations for Issue-105.
Reviewed Jennifer Mezick's Tenure and Promotion Documentaiton and provided feedback.

import os
import subprocess
from tqdm import tqdm

class OCRGenerator:
    def __init__(self, path):
        self.path = path
        self.files = self.__crawl()

    def __crawl(self):
        for root, dirs, files in os.walk(self.path):
            for file in files:
                if file.endswith(".tif"):
                    yield os.path.join(root, file)

    @staticmethod
    def process_hocr(file):
        hocr_command  = f"tesseract {file} {file.replace('.tif', '')} -l eng hocr"
        try:
            result = subprocess.run(hocr_command, shell=True, capture_output=True, text=True)
            if result.returncode == 0:
                return result.stdout
            else:
                # If the command failed, raise an exception with stderr
                raise RuntimeError(result.stderr)
        except Exception as e:
            print("An error occurred:", e)

    @staticmethod
    def process_ocr(file):
        hocr_command = f"tesseract {file} {file.replace('.tif', '')} -l eng"
        try:
            result = subprocess.run(hocr_command, shell=True, capture_output=True, text=True)
            if result.returncode == 0:
                return result.stdout
            else:
                # If the command failed, raise an exception with stderr
                raise RuntimeError(result.stderr)
        except Exception as e:
            print("An error occurred:", e)


if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('-p', '--path_to_files', help='Specify Path to Run OCR Against', default='/home/mbagget1')
    args = parser.parse_args()
    ocr = OCRGenerator(args.path_to_files)
    for file in tqdm(ocr.files):
        ocr.process_hocr(file)
        ocr.process_ocr(file)
Next Week


Take care of travel arrangements for Code4Lib and IIIF.
Meeting with Kirk this morning.
Work on import of new collections for Josh.
Test how we need to approach imports on qa. When do we stay in the same importer, and when do we break out? 😕
Get migration back on track.
Script something that will crawl failed filesets and check that they are indeed working after "fix".
Document various import errors, what they mean, and how to address each.
Think about and respond to scientist-softserv/utk-hyku#18
Continue drafting procedures, policies, and practices for DOI minting at UTK beyond the Libraries.
Follow through on commitment to investigate and report on Regression A/V Problem in UniversalViewer