Skip to content

Instantly share code, notes, and snippets.

@tarekeldeeb
Last active June 21, 2022 17:05
Show Gist options
  • Save tarekeldeeb/016739c7fea6fe5bf19214e49b5cf9db to your computer and use it in GitHub Desktop.
Save tarekeldeeb/016739c7fea6fe5bf19214e49b5cf9db to your computer and use it in GitHub Desktop.
Bash Script to Remove Arabic Dialects from UTF-8 or Windows-1256 / iso-8859-1 Encoding
# Bash Script to Remove Arabic Dialects from UTF-8 or Windows-1256 / iso-8859-1 Encoding
# - Converts arabic commas to latin comma
# - Remove Dialect symbols
# - Remove running spaces with a single
# - Replace Alif-with-hamza with Alif
#
# Example: removeArabicDialects my_utf8.txt > clear.txt
# Install: Copy this gist into your ~/.bashrc
# Author: Tarek Eldeeb
#
removeArabicDialects () {
if [[ $(file -bi $1 | grep -c utf) -gt 0 ]] ; then
sed "s/[$(echo -ne '\u060C\u061B')]/,/g" $1 | \
sed "s/[$(echo -ne '\u064B-\u065E')]//g" | \
sed "s/ \+/ /g" | \
sed "s/[$(echo -ne '\u0622\u0623\u0625')]/$(echo -ne '\u0627')/g";
else
cat $1 | tr $'\xA1\xBA.,:t' ' ' | \
tr -d '\356-\377\327\334\340\342\347-\353'| \
sed "s/ \+/ /g"| \
tr $'\xc5\xc2\xc3' $'\xc7';
fi;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment