Skip to content

Instantly share code, notes, and snippets.

@damionjunk
Created December 13, 2012 01:30
Show Gist options
  • Save damionjunk/4273277 to your computer and use it in GitHub Desktop.
Save damionjunk/4273277 to your computer and use it in GitHub Desktop.
A quick example of Clojure / StanfordNLP CoNLL parsing.
(ns wujuko.nlp.fmt-read
(:require [clojure.java.io :as io])
(:import [java.util Properties]
[edu.stanford.nlp.sequences
SeqClassifierFlags
CoNLLDocumentReaderAndWriter
ColumnDocumentReaderAndWriter]
[edu.stanford.nlp.ling CoreLabel]))
(defn corelabel->map
[^CoreLabel cl]
{:token (.word cl) :tag (.tag cl)})
(defn words
[ms] (map :token ms))
(defn pos
[ms] (map :tag ms))
(defn ner
[ms] (map :ner ms))
(defn conll-seq
""
[^java.io.Reader reader]
(let [cdr (ColumnDocumentReaderAndWriter.)
_ (.init cdr (SeqClassifierFlags.))
ci (.getIterator cdr reader)]
(iterator-seq ci)))
(defn conll-map-seq
"Returns a sequence of maps in the form:
( {:token \"Hi\" :tag \"O\"}
{:token \"There\" :tag \"O\"}
...
) "
[^java.io.Reader reader]
(let [cs (conll-seq reader)]
(map (fn [ca] (map corelabel->map ca)) cs)))
(comment
(let [filename "/Users/djunk/projects/L645/project/arktweetnlp/ark-tweet-nlp/data/twpos-data-v0.3/oct27.conll"]
(with-open [rdr (io/reader (io/file filename))]
(let [ms (conll-map-seq rdr)
samp (first (take 1 ms))]
(words samp)
(tags samp)
)))
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment