Skip to content

Instantly share code, notes, and snippets.

@sorenmacbeth
Created October 10, 2011 04:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sorenmacbeth/1274606 to your computer and use it in GitHub Desktop.
Save sorenmacbeth/1274606 to your computer and use it in GitHub Desktop.
(defn lemmatize-text
"Apply a lucene tokenizer to cleaned text content as a lazy-seq"
[page-text]
(let [reader (java.io.StringReader. page-text)
analyzer (->
(resource-to-temp-file
"stanford_nlp_models/bidirectional-distsim-wsj-0-18.tagger"
".tagger")
(.getAbsolutePath)
(MaxentTagger.)
(EnglishLemmaAnalyzer.))
tokenizer (.tokenStream analyzer nil reader)
term-att (.addAttribute tokenizer TermAttribute)]
(tokenizer-seq tokenizer term-att)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment