Skip to content

Instantly share code, notes, and snippets.

@iammapping
Last active May 15, 2018 04:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save iammapping/06627ce5d9367431512925912bf90b0e to your computer and use it in GitHub Desktop.
Save iammapping/06627ce5d9367431512925912bf90b0e to your computer and use it in GitHub Desktop.
tokenize with substring
<?php
function tokenize($str, $minLength = 3) {
$strLength = mb_strlen($str);
if ($strLength <= $minLength) {
return [
"[{$str}",
"{$str}]",
"[{$str}]"
];
}
$subLength = $minLength;
$tokens = [];
while ($subLength <= $strLength) {
$start = 0;
while (($start + $subLength) <= $strLength) {
$token = mb_substr($str, $start, $subLength);
if ($str !== $token) {
if ($start === 0) {
$tokens[] = '[' . $token;
} elseif (($start + $subLength) === $strLength) {
$tokens[] = $token . ']';
} else {
$tokens[] = $token;
}
} else {
$tokens[] = "[{$token}";
$tokens[] = "{$token}]";
$tokens[] = "[{$token}]";
}
$start++;
}
$subLength++;
}
return $tokens;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment