Skip to content

Instantly share code, notes, and snippets.

@ravisorg
Last active November 6, 2023 08:18
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ravisorg/23edafbfcbd45de9875adec5310fca76 to your computer and use it in GitHub Desktop.
Save ravisorg/23edafbfcbd45de9875adec5310fca76 to your computer and use it in GitHub Desktop.
Generate a PHP compatible regular expression to match emoji from the most recent unicode data.
<?php
/**
* Uses the data from unicode.org's emoji-data.txt to build a PHP compatible Regular Expression
* along the lines of:
* (?:\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\x{FE0F}?)
*
* To use: php build-hashtag-regexp.php <emoji-data.txt>
* Output will be the generated regular expression.
*
* Get a current copy of emoji-data.txt from http://www.unicode.org/Public/emoji/latest/emoji-data.txt
*/
if (!isset($argv[1]) || !file_exists($argv[1])) {
print "Usage: build-hashtag-regexp.php <emoji-data.txt>\n";
print "\n";
print "Prints the generated PHP compatible Regular Expression to STDOUT.\n";
print "\n";
print "Grab emoji-data.txt from http://www.unicode.org/Public/emoji/latest/emoji-data.txt\n";
die(1);
}
$emojiFilename = $argv[1];
$emojiFile = file($emojiFilename);
$emojiData = array();
$emojiClasses = array();
foreach ($emojiFile as $line) {
$pos = strpos($line,'#');
if ($pos!==false) {
$line = substr($line,0,$pos);
}
$line = trim($line);
if (!$line) {
continue;
}
$line = explode(';',$line);
if (count($line)!=2) {
continue;
}
$range = strtoupper(trim($line[0]));
$class = trim($line[1]);
if (!isset($emojiClasses[$class])) {
$emojiClasses[$class] = array();
}
$range = explode('..',$range);
if (count($range)==1) {
$emojiClasses[$class][] = '\\x{'.$range[0].'}';
}
else {
$emojiClasses[$class][] = '[\\x{'.$range[0].'}-\\x{'.$range[1].'}]';
}
}
$emojiRegexp = '(?:\\p{Emoji_Modifier_Base}\\p{Emoji_Modifier}?|\\p{Emoji_Presentation}|\\p{Emoji}\\x{FE0F}?)';
foreach ($emojiClasses as $class=>$components) {
$emojiRegexp = str_replace('\\p{'.$class.'}','(?:'.implode('|',$components).')',$emojiRegexp);
}
print $emojiRegexp;
@pscheit
Copy link

pscheit commented May 3, 2020

thanks so much! this needs the u modifier attached

@techprogrammer
Copy link

techprogrammer commented Feb 15, 2021

Last valid file is 12.1 10/2019
https://www.unicode.org/Public/emoji/12.1/emoji-data.txt

Current unicode directory no longer has this file
http://www.unicode.org/Public/emoji/latest/emoji-data.txt

@e1himself
Copy link

The latest released emoji-data.txt file I've found is for 15.0.0 (which is the latest released unicode version as of now) is here:

https://www.unicode.org/Public/15.0.0/ucd/emoji/emoji-data.txt

@marcelopm
Copy link

Seems like the latest emoji data moved to https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment