Skip to content

Instantly share code, notes, and snippets.

@santhoshtr
Created February 28, 2020 10:38
Show Gist options
  • Save santhoshtr/1d2143ed5a4987b31c8c1a2c17564263 to your computer and use it in GitHub Desktop.
Save santhoshtr/1d2143ed5a4987b31c8c1a2c17564263 to your computer and use it in GitHub Desktop.
Malayalam corpus cleanup script
# Misc clean up on corpus
# sed -i -f corpora-cleanup.sed corpus/*.txt
# Chillu normalization
s/ന്‍/ൻ/g
s/ള്‍/ൾ/g
s/ല്‍/ൽ/g
s/ര്‍/ർ/g
s/ന്‍/ൻ/g
s/ണ്‍/ൺ/g
# Remove ZWNJ at end of words
s/\xE2\x80\x8C$//g
# Remove all other ZWJ
s/\xE2\x80\x8D//g
# Remove all soft hyphens
s/\xC2\xAD//g
# Replace old au sign with new one
s/‍ൌ/ൗ/g
#Common mistakes
s/പക്ഷെ/പക്ഷേ/g
# ZWNJs
s/ു‌/ു/g
s/ി‌/ു/g
s/ോ‌/ോ/g
s/ാ‌/ാ/g
s/ഒാ/ഓ/g
# ൻറെ -> ന്റെ at the end of words
s/ൻറെ/ന്റെ/g
s/ൻറ്$/ന്റ്/g
s/ൻറും$/ന്റും/g
s/ൻറിൽ$/ന്റിൽ/g
# ുൻപോൾ - ുമ്പോൾ
s/ുൻപോൾ/ുമ്പോൾ/g
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment