Skip to content

Instantly share code, notes, and snippets.

@karimkhanp
Last active December 14, 2018 13:11
Show Gist options
  • Save karimkhanp/8a5f89348f19fb4c1fba15315bba71ec to your computer and use it in GitHub Desktop.
Save karimkhanp/8a5f89348f19fb4c1fba15315bba71ec to your computer and use it in GitHub Desktop.
How to handle hexacode issue in python while dealing with non english text. This problem usually occurs when you are dealing with non-english news tweets or similar sort of data.
a = u'\xd8\xad\xd9\x83\xd9\x88\xd9\x85\xd8\xa9 \xd9\x85\xd8\xad\xd9\x85\xd8\xaf \xd8\xa8\xd9\x86 \xd8\xb3\xd9\x84\xd9\x85\xd8\xa7\xd9\x86 \xd8\xa3\xd9\x86\xd9\x81\xd9\x82\xd8\xaa \xd9\x85\xd9\x84\xd9\x8a\xd8\xa7\xd8\xb1\xd8\xa7\xd8\xaa \xd8\xa7\xd9\x84\xd8\xaf\xd9\x88\xd9\x84\xd8\xa7\xd8\xb1\xd8\xa7\xd8\xaa \xd9\x84\xd8\xaf\xd8\xb9\xd9\x85 \xd8\xb3\xd9\x88\xd9\x82 \xd8\xa7\xd9\x84\xd8\xa3\xd8\xb3\xd9\x87\xd9\x85 \xd8\xa7\xd9\x84\xd9\x85\xd8\xad\xd9\x84\xd9\x8a\xd8\xa9 \xd9\x88\xd9\x85\xd9\x88\xd8\xa7\xd8\xac\xd9\x87\xd8\xa9 \xd9\x85\xd9\x88\xd8\xac\xd8\xa7\xd8\xaa \xd8\xa7\xd9\x84\xd8\xa8\xd9\x8a\xd8\xb9 \xd8\xa8\xd8\xb9\xd8\xaf \xd9\x85\xd9\x82\xd8\xaa\xd9\x84\xe2\x80\xa64'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
حكومة محمد بن سلمان أنفقت مليارات الدولارات لدعم سوق الأسهم المحلية ومواجهة موجات البيع بعد مقتل…4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment