Skip to content

Instantly share code, notes, and snippets.

@mutekinootoko
Last active October 31, 2016 07:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mutekinootoko/f35f3caf22ec646c56d57aa3bc22899a to your computer and use it in GitHub Desktop.
Save mutekinootoko/f35f3caf22ec646c56d57aa3bc22899a to your computer and use it in GitHub Desktop.
n-gram tokenizer lucene 6
ArrayList<String> grams = new ArrayList<>();
FileReader fr = new FileReader(args[0]);
System.out.println("loading content file: " + args[0]);
BufferedReader br = new BufferedReader(fr);
NGramTokenizer nGramTokenizer = new NGramTokenizer(2,5);
nGramTokenizer.setReader(br);
CharTermAttribute charTermAttribute = nGramTokenizer.addAttribute(CharTermAttribute.class);
nGramTokenizer.reset();
while(nGramTokenizer.incrementToken()) {
grams.add(charTermAttribute.toString());
}
nGramTokenizer.end();
nGramTokenizer.close();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment