Skip to content

Instantly share code, notes, and snippets.

@Slater-Victoroff
Created March 25, 2013 08:54
Show Gist options
  • Save Slater-Victoroff/5235788 to your computer and use it in GitHub Desktop.
Save Slater-Victoroff/5235788 to your computer and use it in GitHub Desktop.
Cleaning Strings to tokens with stemming and url removal
public Set<String> parseRawString(String rawString, SnowballStemmer stemmer){
Set<String> answer = new HashSet<String>();
String[] firstSplit = rawString.split("[\\t\\n\\r]");
List<String> rawSplit = new ArrayList<String>();
for (String s: firstSplit) try{
URL url = new URL(s);
} catch (MalformedURLException e){
rawSplit.addAll(Arrays.asList(s.split("[\\p{P}]")));
}
for (String s: rawSplit){
stemmer.setCurrent(s.toLowerCase());
stemmer.stem();
String addition = stemmer.getCurrent();
if (addition.length()>1){
answer.add(addition);
}
}
return answer;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment