Skip to content

Instantly share code, notes, and snippets.

@Kroc
Created August 31, 2012 10:41
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Kroc/3551351 to your computer and use it in GitHub Desktop.
Save Kroc/3551351 to your computer and use it in GitHub Desktop.
Kroc's Transliteration Method
//safeTransliterate v3, copyright (cc-by 3.0) Kroc Camen <camendesign.com>
//generate a safe (a-z0-9_) string, for use as filenames or URLs, from an arbitrary string
function safeTransliterate ($text) {
//if available, this function uses PHP5.4's transliterate, which is capable of converting arabic, hebrew, greek,
//chinese, japanese and more into ASCII! however, we use our manual (and crude) fallback *first* instead because
//we will take the liberty of transliterating some things into more readable ASCII-friendly forms,
//e.g. "100℃" > "100degc" instead of "100oc"
/* manual transliteration list:
-------------------------------------------------------------------------------------------------------------- */
/* this list is supposed to be practical, not comprehensive, representing:
1. the most common accents and special letters that get typed, and
2. the most practical transliterations for readability;
given that I know nothing of other languages, I will need your assistance to improve this list,
mail kroc@camendesign.com with help and suggestions.
this data was produced with the help of:
http://www.unicode.org/charts/normalization/
http://www.yuiblog.com/sandbox/yui/3.3.0pr3/api/text-data-accentfold.js.html
http://www.utf8-chartable.de/
*/
static $translit = array (
'a' => '/[ÀÁÂẦẤẪẨÃĀĂẰẮẴȦẲǠẢÅÅǺǍȀȂẠẬẶḀĄẚàáâầấẫẩãāăằắẵẳȧǡảåǻǎȁȃạậặḁą]/u',
'b' => '/[ḂḄḆḃḅḇ]/u', 'c' => '/[ÇĆĈĊČḈçćĉċčḉ]/u',
'd' => '/[ÐĎḊḌḎḐḒďḋḍḏḑḓð]/u',
'e' => '/[ÈËĒĔĖĘĚȄȆȨḔḖḘḚḜẸẺẼẾỀỂỄỆèëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ]/u',
'f' => '/[Ḟḟ]/u', 'g' => '/[ĜĞĠĢǦǴḠĝğġģǧǵḡ]/u',
'h' => '/[ĤȞḢḤḦḨḪĥȟḣḥḧḩḫẖ]/u', 'i' => '/[ÌÏĨĪĬĮİǏȈȊḬḮỈỊiìïĩīĭįǐȉȋḭḯỉị]/u',
'j' => '/[Ĵĵǰ]/u', 'k' => '/[ĶǨḰḲḴKķǩḱḳḵ]/u',
'l' => '/[ĹĻĽĿḶḸḺḼĺļľŀḷḹḻḽ]/u', 'm' => '/[ḾṀṂḿṁṃ]/u',
'n' => '/[ÑŃŅŇǸṄṆṈṊñńņňǹṅṇṉṋ]/u',
'o' => '/[ÒÖŌŎŐƠǑǪǬȌȎȪȬȮȰṌṎṐṒỌỎỐỒỔỖỘỚỜỞỠỢØǾòöōŏőơǒǫǭȍȏȫȭȯȱṍṏṑṓọỏốồổỗộớờởỡợøǿ]/u',
'p' => '/[ṔṖṕṗ]/u', 'r' => '/[ŔŖŘȐȒṘṚṜṞŕŗřȑȓṙṛṝṟ]/u',
's' => '/[ŚŜŞŠȘṠṢṤṦṨſśŝşšșṡṣṥṧṩ]/u', 'ss' => '/[ß]/u',
't' => '/[ŢŤȚṪṬṮṰţťțṫṭṯṱẗ]/u', 'th' => '/[Þþ]/u',
'u' => '/[ÙŨŪŬŮŰŲƯǓȔȖṲṴṶṸṺỤỦỨỪỬỮỰùũūŭůűųưǔȕȗṳṵṷṹṻụủứừửữựµ]/u',
'v' => '/[ṼṾṽṿ]/u', 'w' => '/[ŴẀẂẄẆẈŵẁẃẅẇẉẘ]/u',
'x' => '/[ẊẌẋẍ×]/u', 'y' => '/[ÝŶŸȲẎỲỴỶỸýÿŷȳẏẙỳỵỷỹ]/u',
'z' => '/[ŹŻŽẐẒẔźżžẑẓẕ]/u',
//combined letters and ligatures:
'ae' => '/[ÄǞÆǼǢäǟæǽǣ]/u', 'oe' => '/[Œœ]/u',
'dz' => '/[DŽDžDZDzdždz]/u',
'ff' => '/[ff]/u', 'fi' => '/[ffifi]/u', 'ffl' => '/[fflfl]/u',
'ij' => '/[IJij]/u', 'lj' => '/[LJLjlj]/u', 'nj' => '/[NJNjnj]/u',
'st' => '/[ſtst]/u', 'ue' => '/[ÜǕǗǙǛüǖǘǚǜ]/u',
//currencies:
'eur' => '/[€]/u', 'cents' => '/[¢]/u', 'lira' => '/[₤]/u', 'dollars' => '/[$]/u',
'won' => '/[₩]/u', 'rs' => '/[₨]/u', 'yen' => '/[¥]/u', 'pounds' => '/[£]/u',
'pts' => '/[₧]/u',
//misc:
'degc' => '/[℃]/u', 'degf' => '/[℉]/u',
'no' => '/[№]/u', 'tm' => '/[™]/u'
);
//do the manual transliteration first
$text = preg_replace (array_values ($translit), array_keys ($translit), $text);
//flatten the text down to just a-z0-9 and dash, with underscores instead of spaces
$text = preg_replace (
//remove punctuation //replace non a-z //deduplicate //trim underscores from start & end
array ('/\p{P}/u', '/[^_a-z0-9-]/i', '/_{2,}/', '/^_|_$/'),
array ('', '_', '_', ''),
//attempt transliteration with PHP5.4's transliteration engine (best):
//(this method can handle near anything, including converting chinese and arabic letters to ASCII.
// requires the 'intl' extension to be enabled)
function_exists ('transliterator_transliterate') ? transliterator_transliterate (
//split unicode accents and symbols, e.g. "Å" > "A°":
'NFKD; '.
//convert everything to the Latin charset e.g. "ま" > "ma":
//(splitting the unicode before transliterating catches some complex cases,
// such as: "㏳" >NFKD> "20日" >Latin> "20ri")
'Latin; '.
//because the Latin unicode table still contains a large number of non-pure-A-Z glyphs (e.g. "œ"),
//convert what remains to an even stricter set of characters, the US-ASCII set:
//(we must do this because "Latin/US-ASCII" alone is not able to transliterate non-Latin characters
// such as "ま". this two-stage method also means we catch awkward characters such as:
// "㏀" >Latin> "kΩ" >Latin/US-ASCII> "kO")
'Latin/US-ASCII; '.
//remove the now stand-alone diacritics from the string
'[:Nonspacing Mark:] Remove; '.
//change everything to lowercase; anything non A-Z 0-9 that remains will be removed by
//the letter stripping above
'Lower',
$text)
//attempt transliteration with iconv: <php.net/manual/en/function.iconv.php>
: strtolower (function_exists ('iconv') ? str_replace (array ("'", '"', '`', '^', '~'), '', strtolower (
//note: results of this are different depending on iconv version,
// sometimes the diacritics are written to the side e.g. "ñ" = "~n", which are removed
iconv ('UTF-8', 'US-ASCII//IGNORE//TRANSLIT', $text)
)) : $text)
);
//old iconv versions and certain inputs may cause a nullstring. don't allow a blank response
return !$text ? '_' : $text;
}
@kurtuluso
Copy link

Everything seems OK but arabic characters are absent.

@naveed81
Copy link

naveed81 commented Nov 21, 2017

Hi, this is was nice. Can you help me transliterate Gujarati (Indian language) to English? I used transliterator_transliterate function but still output is not clean. Need your suggestions.

UPDATE:
Its working for gujarati. This is one of the most elegant codes I have seen in a while. Hats off to you.

@pjdevries
Copy link

Thank you for this nice piece of code.

Could you make it so that safe is optional, i.e. no flattening and return of the manually transliterated text in it's original case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment