Skip to content

Instantly share code, notes, and snippets.

@apjanco
Last active February 20, 2024 16:38
Show Gist options
  • Save apjanco/1a4595716b1119d1f247fd4e7ba5e10b to your computer and use it in GitHub Desktop.
Save apjanco/1a4595716b1119d1f247fd4e7ba5e10b to your computer and use it in GitHub Desktop.
workflow.md

eScriptorium Iterative Workflow

Iterative workflow:

  • Sample: Curate a batch of document pages (say 100). The corpus should reflect common kinds of documents in the collection. Split into train and test sets.
  • Predict: Auto-transcribe with the current best model using Trainer (starting with Vision or Araucania?)
  • Upload: Upload the image files and transcriptions with Fetcher
  • Correct the errors in eScriptorium
  • Fine-tune the current best model on the new data
  • Assess improvement using test data. Generate word character error and word error rate metrics.
  • Evaluate model transcriptions for research tasks. Record issues and areas that require improvement.
  graph TD;
      Sample-->Predict;
      Predict-->Correct;
      Correct-->Fine-tune;
      Fine-tune-->Assess;
      Assess-->Evaluate;
      Evaluate-->Sample;
  • Workflow is facilitated by helper tools

  • Experiment: how much work can a student reasonably be expected to do in an hour? 4-5 pages? How many images should be in each batch?

  • Set aside a portion of the corrected text for the "gold standard" evaluation set that is used to assess the model's performance. This material should never appear in the model's training data. The evaluation set should contain a variety of document types and be representative of the material the model will be used on.

  • We'll create a new project in eScriptorium for this work. Each batch will be a document.

  • Establish benchmarks. At what point is the model sufficient for necessary project tasks? Search. NER. Entity Linking, Topic modeling, Summarization with generative AI, term frequency.

  • Need an evaluation metric - character error rate, Word Error Rate (WER) given an input image, how many characters are wrong in the output text?

  • Need a system for managing project work.

    • Identify images to add to batch
    • Signal that prediction is complete
    • Way to record that a transcription has been corrected
    • A way to know that all transcriptions in the batch have been corrected
    • A way to share the assessment results and plan for next batch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment