Skip to content

Instantly share code, notes, and snippets.

@feupeu
Created September 28, 2015 10:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save feupeu/9c524745467ad7c93bc0 to your computer and use it in GitHub Desktop.
Save feupeu/9c524745467ad7c93bc0 to your computer and use it in GitHub Desktop.
// Check if the body is empty
if(doc.body() != null) {
// Find all text on the page
final String text = doc.body().text();
// Split the text into tokens
final ArrayList<String> tokens = new ArrayList<>();
final StringTokenizer tokenizer = new StringTokenizer(text);
// Add all valid tokens
while(tokenizer.hasMoreTokens()) {
final String token = tokenizer.nextToken();
if(!Constants.stopWords.contains(token)) {
// Add the stemmed token
tokens.add(Porter.stem(token));
}
}
// Add all the words to the index
for (String term : tokens) {
Index.addTerm(term, url);
}
// System.out.println("Found " + tokens.size() + " terms.");
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment