Skip to content

Instantly share code, notes, and snippets.

@xeoncross
Last active September 27, 2022 09:28
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save xeoncross/8c455e8bee52feedd252f9df1f593cf5 to your computer and use it in GitHub Desktop.
Save xeoncross/8c455e8bee52feedd252f9df1f593cf5 to your computer and use it in GitHub Desktop.
Simple golang ngrams, bigrams, trigrams, and just plain word pair counter from any given slice of strings.
package main
import (
"fmt"
"math"
"strings"
"unicode"
)
// SplitOnNonLetters splits a string on non-letter runes
func SplitOnNonLetters(s string) []string {
notALetter := func(char rune) bool { return !unicode.IsLetter(char) }
return strings.FieldsFunc(s, notALetter)
}
var str = "This is a 'sentence' about this thing I wrote. I wrote it yesterday."
func main() {
str = strings.ToLower(str)
parts := SplitOnNonLetters(str)
fmt.Printf("%+v\n", parts)
fmt.Println(ngrams(parts, 2))
fmt.Println(ngrams(parts, 3))
}
func ngrams(words []string, size int) (count map[string]uint32) {
count = make(map[string]uint32, 0)
offset := int(math.Floor(float64(size / 2)))
max := len(words)
for i, word := range words {
if i < offset || i+size-offset > max {
continue
}
gram := strings.Join(words[i-offset:i+size-offset], " ")
count[gram]++
}
return count
}
@xeoncross
Copy link
Author

If you want to compare sets of ngrams you can look at this: https://gist.github.com/miku/22a6a84a58db012817ac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment