Skip to content

Instantly share code, notes, and snippets.

@kba
Created April 10, 2018 14:55
Show Gist options
  • Save kba/7d4e8f9996a81eab5349573d5f1aa819 to your computer and use it in GitHub Desktop.
Save kba/7d4e8f9996a81eab5349573d5f1aa819 to your computer and use it in GitHub Desktop.

See also ocr-d.github.io/PhilTag-2018

1. Generierung Tif

→ convert

2. Generierung Box-Files

for y in `ls /home/binder/OCR/ocropus/fraktur_19jh/SELECTION-TRAIN/training/`; do
  echo $y
  for l in `ls /home/binder/OCR/ocropus/fraktur_19jh/SELECTION-TRAIN/training/$y/*.bin.png`;do
    base=`basename $l .bin.png`; echo "$y"_$base
    convert $l data/"$y"_$base.tif;
    python generate_line_box.py -i data/"$y"_$base.tif \
      -t /home/binder/OCR/ocropus/fraktur_19jh/SELECTION-TRAIN/training/"$y"/$base.gt.txt \
      > data/"$y"_$base.box; cp
    /home/binder/OCR/ocropus/fraktur_19jh/SELECTION-TRAIN/training/"$y"/$base.gt.txt data/"$y"_$base.gt.txt
  done
done

3. Generierung Codec

  /usr/local/bin/unicharset_extractor --output_unicharset springmann.unicharset --norm_mode 1 *.box

4. Generierung lstmf-Files

for i in `ls *.tif`;do
  base=`basename $i .tif`
  echo $base
  tesseract $i $base lstm.train
done

5. Generierung Proto-Modell

/usr/local/bin/combine_lang_model \
  --input_unicharset Fraktur.unicharset \
  --script_dir /home/kmw/built/langdata
  --output_dir tmp/ \
  --lang Fraktur

6. Aufteilung der Daten in Test und Train

  ls data/*.lstmf | sort -R > Fraktur.files.random.txt
  head -n 300 Fraktur.files.random.txt > Fraktur.test_files.txt
  tail -n +301 Fraktur.files.random.txt > Fraktur.training_files.txt

7. Modelltraining

NOTE O1c61 bedeutet dass 61 Zeichen im Alphabet (Codec) sind.

lstmtraining \
  --traineddata tmp/Fraktur/Fraktur.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c61]' \
  --model_output out/base \
  --learning_rate 20e-4 \
  --train_listfile Fraktur.training_files.txt \
  --eval_listfile Fraktur.test_files.txt \
  --max_iterations 10000

8. Modellabschluss

lstmtraining \
  --stop_training \
  --continue_from out/base_checkpoint \
  --traineddata tmp/Fraktur/Fraktur.traineddata \
  --model_output out/Fraktur.traineddata

9. Erkennung der Testdaten

for i in `ls test/*.tif`;do
  base=`basename $i .tif`
  echo $base
  tesseract --tessdata-dir out/ -psm 13 -l Fraktur test/$base.tif test/$base
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment