Created
February 28, 2020 10:38
-
-
Save santhoshtr/1d2143ed5a4987b31c8c1a2c17564263 to your computer and use it in GitHub Desktop.
Malayalam corpus cleanup script
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Misc clean up on corpus | |
# sed -i -f corpora-cleanup.sed corpus/*.txt | |
# Chillu normalization | |
s/ന്/ൻ/g | |
s/ള്/ൾ/g | |
s/ല്/ൽ/g | |
s/ര്/ർ/g | |
s/ന്/ൻ/g | |
s/ണ്/ൺ/g | |
# Remove ZWNJ at end of words | |
s/\xE2\x80\x8C$//g | |
# Remove all other ZWJ | |
s/\xE2\x80\x8D//g | |
# Remove all soft hyphens | |
s/\xC2\xAD//g | |
# Replace old au sign with new one | |
s/ൌ/ൗ/g | |
#Common mistakes | |
s/പക്ഷെ/പക്ഷേ/g | |
# ZWNJs | |
s/ു/ു/g | |
s/ി/ു/g | |
s/ോ/ോ/g | |
s/ാ/ാ/g | |
s/ഒാ/ഓ/g | |
# ൻറെ -> ന്റെ at the end of words | |
s/ൻറെ/ന്റെ/g | |
s/ൻറ്$/ന്റ്/g | |
s/ൻറും$/ന്റും/g | |
s/ൻറിൽ$/ന്റിൽ/g | |
# ുൻപോൾ - ുമ്പോൾ | |
s/ുൻപോൾ/ുമ്പോൾ/g |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment