Skip to content

Instantly share code, notes, and snippets.

@Humoud
Last active April 1, 2024 03:48
Show Gist options
  • Star 37 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save Humoud/f40f58cd85c5935a444c to your computer and use it in GitHub Desktop.
Save Humoud/f40f58cd85c5935a444c to your computer and use it in GitHub Desktop.
Detecting arabic characters with regex.

Detect all Arabic Characters:

/[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufbc1]|[\ufbd3-\ufd3f]|[\ufd50-\ufd8f]|[\ufd92-\ufdc7]|[\ufe70-\ufefc]|[\uFDF0-\uFDFD]/

Summary:

  Arabic (0600—06FF, 225 characters)

  Arabic Supplement (0750—077F, 48 characters)

  Arabic Extended-A (08A0—08FF, 39 characters)

  Arabic Presentation Forms-A (FB50—FDFF, 608 characters)

  Arabic Presentation Forms-B (FE70—FEFF, 140 characters)

  Rumi Numeral Symbols (10E60—10E7F, 31 characters)

  Arabic Mathematical Alphabetic Symbols (1EE00—1EEFF, 143 characters)

For more info check this Wiki link to see arabic letters in Unicode:

https://en.wikipedia.org/wiki/Arabic_(Unicode_block)

References:

http://stackoverflow.com/questions/11323596/regular-expression-for-arabic-language

@abousselmi
Copy link

abousselmi commented Mar 29, 2020

Very useful, thanks !

I used it in a regex instruction to keep arabic and numeric chars and remove the rest:

...
t = re.sub(r'[^0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD]+', ' ', text)
...

@AhmedAbouelkher
Copy link

do you have an example in golang?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment