Skip to content

Instantly share code, notes, and snippets.

@ahmednasir91
Created May 1, 2014 13:14
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ahmednasir91/0cf805b5843b295e8959 to your computer and use it in GitHub Desktop.
Save ahmednasir91/0cf805b5843b295e8959 to your computer and use it in GitHub Desktop.
<collation name="utf8_arabic" id="100">
<rules>
<reset>\u0627</reset>
<i>\u0622</i>
<i>\u0623</i>
<i>\u0625</i>
</rules>
<rules>
<reset>\u0647</reset>
<i>\u0629</i>
</rules>
<rules>
<reset>\u0649</reset>
<i>\u064a</i>
</rules>
</collation>
@mzeidhassan
Copy link

as-salamo Alaikom Ahmed,

I have been following your question @ https://stackoverflow.com/questions/23272518/normalize-arabic-text-mysql

but I don't know how to use this index file in MySql? Can you please provide instruction?

Also, can I use the same to ignore Arabic diacritics 'tashkeel' during search?

Thanks in advance for your help!

Mohamed

@ahm-essam
Copy link

ahm-essam commented Aug 15, 2017

@mzeidhassan most likely you worked things out months ago, I am just adding the answer to your question in case someone comes here looking for a solution.

You need to add the collation to a file called 'Index.xml'. Its location varies from system to another, you can find it on your system by querying the 'information_schema' database with the following query:

SHOW VARIABLES LIKE 'character_sets_dir';

Backup the file, scroll to element <charset name="utf8"> and add the collation there. It will look like the following:

<charset name="utf8">
.
.
.
  <collation name="utf8_arabic_ci" id="1029">
   <rules>
     <reset>\u0627</reset> <!-- Alef 'ا' -->
     <i>\u0623</i>        <!-- Alef With Hamza Above 'أ' -->
     <i>\u0625</i>        <!-- Alef With Hamza Below 'إ' -->
     <i>\u0622</i>        <!-- Alef With Madda Above 'آ' -->
   </rules>
   <rules>
     <reset>\u0629</reset> <!-- Teh Marbuta 'ة' -->
     <i>\u0647</i>        <!-- Heh 'ه' -->
   </rules>
   <rules>
     <reset>\u0000</reset> <!-- Ignore Tashkil -->
     <i>\u064E</i>        <!-- Fatha 'َ' -->
     <i>\u064F</i>        <!-- Damma 'ُ' -->
     <i>\u0650</i>        <!-- Kasra 'ِ' -->
     <i>\u0651</i>        <!-- Shadda 'ّ' -->
     <i>\u064F</i>        <!-- Sukun 'ْ' -->
     <i>\u064B</i>        <!-- Fathatan 'ً' -->
     <i>\u064C</i>        <!-- Dammatan 'ٌ' -->
     <i>\u064D</i>        <!-- Kasratan 'ٍ' -->
   </rules>
 </collation>
</charset>

My collation here named 'utf8_arabic_ci' is the same as Ahmed Nasir, I just added the part to ignore tashkil. You will have to restart MySQL, and then change the collation of the column with a query like:

ALTER TABLE persons MODIFY name VARCHAR(50) CHARACTER SET 'utf8' COLLATE 'utf8_arabic_ci';

For more information you can check my blog post on this subject:
arabic-case-insensitive-in-database-systems
And the MySQL documentation about adding a new collation

@hzakher
Copy link

hzakher commented Jun 2, 2018

thanks for this gist.
maybe you should add the rule for Ya2 as well
resetting ي and ى

 <rules>
      <reset>\u064A</reset>
      <i>\u0649</i>
    </rules>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment