Skip to content

Instantly share code, notes, and snippets.

@caugner
Created July 22, 2013 14:37
Show Gist options
  • Save caugner/6054317 to your computer and use it in GitHub Desktop.
Save caugner/6054317 to your computer and use it in GitHub Desktop.
Ruby regular expression for German texts
#!/bin/ruby
# uppercase characters
uchar = 'A-ZÄÖÜÁÀÉÈ'
# lowercase characters
lchar = 'a-zäöüßáàéè'
# all characters
char = uchar + lchar
# whitespace characters
space_char = ' '
# whitespace
space = '(?:[' + space_char + ']+)'
# hyphens
hyphen = '(?:' + space + '(?:--|-|–|—)' + space + ')'
# end of word
eow = '(?:' + space + '|,' + space + '|;' + space + '|\.' + space + '(?![A-Z]))'
# end of sentence
eos_char = '\\.\\?\\!…'
eos = '(?:[' + eos_char + ']+)'
# uppercase word
uword = '[' + uchar + '+]'
# lowercase word
lword = '[' + lchar + '+]'
# capital word
cword = '[' + uchar + '][' + char + ']*'
# any word
word = '[' + char + ']+'
# before word
before_word = '(?:' + eow + '|' + hyphen + ')'
# bracket text
in_bracket = '(?:' + word + '(?:' + before_word + word + ')*)'
bracket_round = '\\(' + in_bracket + '\\)'
bracket_square = '\\[' + in_bracket + '\\]'
bracket_curly = '\\{' + in_bracket + '\\}'
brackets = '(?:' + [bracket_round, bracket_square, bracket_curly].join('|') + ')'
# quoted text
in_quote = in_bracket + ",?"
quote_german = '„' + in_quote + '“'
quote_double = '"' + in_quote + '"'
quote_single = "'" + in_quote + "'"
quotes = '(?:' + [quote_german, quote_double,quote_single].join('|') + ')'
# sentence
sentence = cword + '(?:' + before_word + '(?:' + [word,brackets,quotes].join('|') + '))+' + eos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment