apjanco/gist:1a4595716b1119d1f247fd4e7ba5e10b

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    eScriptorium Iterative Workflow

Iterative workflow:


Sample: Curate a batch of document pages (say 100). The corpus should reflect common kinds of documents in the collection. Split into train and test sets.
Predict: Auto-transcribe with the current best model using Trainer (starting with Vision or Araucania?)
Upload: Upload the image files and transcriptions with Fetcher
Correct the errors in eScriptorium
Fine-tune the current best model on the new data
Assess improvement using test data. Generate word character error and word error rate metrics.
Evaluate model transcriptions for research tasks. Record issues and areas that require improvement.


        graph TD;
      Sample-->Predict;
      Predict-->Correct;
      Correct-->Fine-tune;
      Fine-tune-->Assess;
      Assess-->Evaluate;
      Evaluate-->Sample;

    
Workflow is facilitated by helper tools


Experiment: how much work can a student reasonably be expected to do in an hour? 4-5 pages? How many images should be in each batch?


Set aside a portion of the corrected text for the "gold standard" evaluation set that is used to assess the model's performance. This material should never appear in the model's training data. The evaluation set should contain a variety of document types and be representative of the material the model will be used on.


We'll create a new project in eScriptorium for this work. Each batch will be a document.


Establish benchmarks. At what point is the model sufficient for necessary project tasks? Search. NER. Entity Linking, Topic modeling, Summarization with generative AI, term frequency.


Need an evaluation metric -  character error rate, Word Error Rate (WER)
given an input image, how many characters are wrong in the output text?


Need a system for managing project work.

Identify images to add to batch
Signal that prediction is complete
Way to record that a transcription has been corrected
A way to know that all transcriptions in the batch have been corrected
A way to share the assessment results and plan for next batch.