Skip to content

Instantly share code, notes, and snippets.

@rameshkrishna
Last active July 16, 2021 13:23
Show Gist options
  • Save rameshkrishna/0cc3d30004b10bfb5987fc6ee6de3b9c to your computer and use it in GitHub Desktop.
Save rameshkrishna/0cc3d30004b10bfb5987fc6ee6de3b9c to your computer and use it in GitHub Desktop.
tesseract_patterns_triaining_file
// Inserts the list of patterns from the given file into the Trie.
// The pattern list file should contain one pattern per line in UTF-8 format.
//
// Each pattern can contain any non-whitespace characters, however only the
// patterns that contain characters from the unicharset of the corresponding
// language will be useful.
// The only meta character is '\'. To be used in a pattern as an ordinary
// string it should be escaped with '\' (e.g. string "C:\Documents" should
// be written in the patterns file as "C:\\Documents").
// This function supports a very limited regular expression syntax. One can
// express a character, a certain character class and a number of times the
// entity should be repeated in the pattern.
//
// To denote a character class use one of:
// \c - unichar for which UNICHARSET::get_isalpha() is true (character)
// \d - unichar for which UNICHARSET::get_isdigit() is true
// \n - unichar for which UNICHARSET::get_isdigit() and
// UNICHARSET::isalpha() are true
// \p - unichar for which UNICHARSET::get_ispunct() is true
// \a - unichar for which UNICHARSET::get_islower() is true
// \A - unichar for which UNICHARSET::get_isupper() is true
//
// \* could be specified after each character or pattern to indicate that
// the character/pattern can be repeated any number of times before the next
// character/pattern occurs.
//
// Examples:
// 1-8\d\d-GOOG-411 will be expanded to strings:
// 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.
//
// http://www.\n\*.com will be expanded to strings like:
// http://www.a.com http://www.a123.com ... http://www.ABCDefgHIJKLMNop.com
//
// Note: In choosing which patterns to include please be aware of the fact
// providing very generic patterns will make tesseract run slower.
// For example \n\* at the beginning of the pattern will make Tesseract
// consider all the combinations of proposed character choices for each
// of the segmentations, which will be unacceptably slow.
// Because of potential problems with speed that could be difficult to
// identify, each user pattern has to have at least kSaneNumConcreteChars
// concrete characters from the unicharset at the beginning.
https://github.com/tesseract-ocr/tesseract/blob/442b5b7/dict/trie.h#L192
https://www.browserling.com/tools/text-from-regex
Sample:
97T\d
97T5
97T0
97T3
97T6
97T4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment