Last active
January 3, 2016 01:59
-
-
Save arm5077/8392538 to your computer and use it in GitHub Desktop.
Snippet from Tomlin ngram project.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?PHP | |
// For this example, $ngram = 1, meaning we're only looking at one word at a time and not phrases | |
$textArray = explode( " ", $formatted ); // $formatted is the text sample with most punctuation removed | |
for ( $i = 0; $i < count( $textArray ) - $ngram; $i++ ) { | |
$chunk = ""; | |
for ( $j = 0; $j < $ngram; $j++ ) { | |
$chunk .= $textArray[ $i + $j ] . " "; // keep adding words to chunk until ngram length is reached | |
} | |
$chunk = trim( $chunk ); //get rid of extra space at the end of chunk | |
if ( $ngramArray[ $chunk ] == "" ) | |
$ngramArray[ $chunk ] = 1; | |
else | |
$ngramArray[ $chunk ]++; | |
} | |
?> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment