fasiha/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Use Kuromoji and UniDic from JVM Clojure

Install

Follow instructions for lein at http://leiningen.org/#install.
Create

$ lein new app clojure-kuromoji
$ cd clojure-kuromoji

Replace

Replace project.clj and src/clojure_kuromoji/core.clj with the two files in this gist.
Get dependencies

$ lein deps

This downloads a bunch of stuff.
Run

$ lein run

Here’s a snippet of the resulting EDN printout:
({:all-features
  ["接頭辞"
   "*"
   "*"
   "*"
   "*"
   "*"
   "オ"
   "御"
   "お"
   "オ"
   "お"
   "オ"
   "和"
   "*"
   "*"
   "促添"
   "基本形"],
  :conjugation [:uninflected],
  :conjugation-type [],
  :final-sound-alternation-form "基本形",
  :final-sound-alternation-type "促添",
  :initial-sound-alternation-form "*",
  :initial-sound-alternation-type "*",
  :known? true,
  :language-type "和",
  :lemma "御",
  :lemma-pronunciation "オ",
  :lemma-reading "オ",
  :literal "お",
  :literal-pronunciation "オ",
  :part-of-speech [:prefix],
  :position 0,
  :user? false,
  :written-base-form "お",
  :written-form "お"}
; ... snip ✂ ...
)
This is Kuromoji’s tokenization of 「お寿司が食べたい。」 using UniDic.
(For full output, see full-output.edn below.)
Abbreviated tokenization

Here’s a subset of the above tokenized data for easier digestion. Again, it’s Kuromoji/UniDic’s tokenization of 「お寿司が食べたい。」.


literal
lemma
part of speech
conjugation
conjugation type


お
御
[:prefix]
[:uninflected]
[]


寿司
寿司
[:noun :common :general]
[:uninflected]
[]


が
が
[:particle :case]
[:uninflected]
[]


食べ
食べる
[:verb :general]
[:continuative :general]
[:shimoichidan-verb-e-row :ba-column]


たい
たい
[:auxiliary-verb]
[:conclusive :general]
[:auxiliary :tai]


。
。
[:supplementary-symbol :period]
[:uninflected]
[]


P.S. IPADIC

In principle you can replace UniDic with IPADIC but that dictionary gives a different API for Token from UniDic’s, so adjust core.clj accordingly.

  
## core.clj
(ns clojure-kuromoji.core
  (:require [clojure.string :as string]
            [clojure.pprint :refer [pprint]])
  (:import [com.atilika.kuromoji.unidic Token Tokenizer])
  (:gen-class))

; see https://gist.github.com/masayu-a/e3eee0637c07d4019ec9
(def keywordize-pos
  {"代名詞"         :pronoun
   "副詞"            :adverb
   "助動詞"          :auxiliary-verb
   "助詞"            :particle
   "係助詞"          :binding
   "副助詞"          :adverbial
   "接続助詞"        :conjunctive
   "格助詞"          :case
   "準体助詞"        :nominal
   "終助詞"          :phrase-final
   "動詞"            :verb
   "一般"            :general
   "非自立可能"      :bound
   "名詞"            :noun
   "助動詞語幹"      :auxiliary
   "固有名詞"        :proper
   "人名"            :name
   "名"              :firstname
   "姓"              :surname
   "地名"            :place
   "国"              :country
   "数詞"            :numeral
   "普通名詞"        :common
   "サ変可能"        :verbal-suru
   "サ変形状詞可能"  :verbal-adjectival
   "副詞可能"        :adverbial-suffix
   "助数詞可能"      :counter
   "形状詞可能"      :adjectival
   "形容詞"          :adjective-i
   "形状詞"          :adjectival-noun
   "タリ"            :tari
   "感動詞"          :interjection
   "フィラー"        :filler
   "接尾辞"          :suffix
   "動詞的"          :verbal
   "名詞的"          :nominal-suffix
   "助数詞"          :counter-suffix
   "形容詞的"        :adjective-i-suffix
   "形状詞的"        :adjectival-noun-suffix
   "接続詞"          :conjunction
   "接頭辞"          :prefix
   "空白"            :whitespace
   "補助記号"        :supplementary-symbol
   "ＡＡ"            :ascii-art
   "顔文字"          :emoticon
   "句点"            :period
   "括弧閉"          :bracket-open
   "括弧開"          :bracket-close
   "読点"            :comma
   "記号"            :symbol
   "文字"            :character
   "連体詞"          :adnominal
   "未知語"          :unknown-words
   "カタカナ文"      :katakana
   "漢文"            :chinese-writing
   "言いよどみ"      :hesitation
   "web誤脱"         :errors-omissions
   "方言"            :dialect
   "ローマ字文"      :latin-alphabet
   "新規未知語"      :new-unknown-words
   })

; see https://gist.github.com/masayu-a/3e11168f9330e2d83a68
(def keywordize-inflection
  {
   "ク語法"      :ku-wording
   "仮定形"      :conditional
   "一般"        :general
   "融合"        :integrated
   "命令形"      :imperative
   "已然形"      :realis
   "補助"        :auxiliary-inflection
   "意志推量形"  :volitional-tentative
   "未然形"      :irrealis
   "サ"          :sa
   "セ"          :se
   "撥音便"      :euphonic-change-n
   "終止形"      :conclusive
   "ウ音便"      :euphonic-change-u
   "促音便"      :euphonic-change-t
   "語幹"        :word-stem
   "連体形"      :attributive
   "イ音便"      :euphonic-change-i
   "省略"        :abbreviation
   "連用形"      :continuative
   "ト"          :change-to
   "ニ"          :change-ni
   "長音"        :long-sound
   "*"           :uninflected
   })

; see https://gist.github.com/masayu-a/b3ce862336e47736e84f
(def keywordize-inflection-type
  {"ユク"          :yuku
   "ダ行"          :da-column
   "ザ行変格"      :zahen-verb-irregular
   "ダ"            :da
   "タイ"          :tai
   "文語ラ行変格"  :classical-ra-column-change
   "ワ行"          :wa-column
   "コス"          :kosu
   "キ"            :ki
   "文語下二段"    :classical-shimonidan-verb-e-u-row
   "ス"            :su
   "ハ行"          :ha-column
   "上一段"        :kamiichidan-verb-i-row
   "イク"          :iku
   "マ行"          :ma-column
   "助動詞"        :auxiliary
   "シク"          :shiku
   "ナ行"          :na-column
   "ガ行"          :ga-column
   "ム"            :mu
   "ア行"          :a-column
   "ザンス"        :zansu
   "文語形容詞"    :classical-adjective
   "タ"            :ta
   "伝聞"          :reported-speech
   "ナイ"          :nai
   "ヘン"          :hen
   "文語助動詞"    :classical-auxiliary
   "ジ"            :ji
   "ワア行"        :wa-a-column
   "文語ナ行変格"  :classical-na-column-change
   "カ行変格"      :kahen-verb-irregular
   "ラシ"          :rashi
   "マイ"          :mai
   "タリ"          :tari
   "呉レル"        :kureru
   "形容詞"        :adjective
   "ゲナ"          :gena
   "一般+う"       :general-u
   "ザマス"        :zamasu
   "ゴトシ"        :gotoshi
   "ヌ"            :nu
   "文語上二段"    :classical-kaminidan-verb-u-i-row
   "ク"            :ku
   "サ行変格"      :sahen-verb-irregular
   "ラ行"          :ra-column
   "下一段"        :shimoichidan-verb-e-row
   "完了"          :final
   "ラシイ"        :rashii
   "文語四段"      :classical-yondan-verb
   "ドス"          :dosu
   "ザ行"          :za-column
   "ツ"            :shi
   "ヤス"          :yasu
   "バ行"          :ba-column
   "断定"          :assertive
   "ナンダ"        :nanda
   "ケリ"          :keri
   "文語サ行変格"  :classical-sa-column-change
   "タ行"          :ta-column
   "ケム"          :kemu
   "カ行"          :ka-column
   "ゲス"          :gesu
   "ヤ行"          :ya-column
   "マス"          :masu
   "レル"          :reru
   "サ行"          :sa-column
   "文語下一段"    :classical-shimoichidan-verb-e-row
   "ベシ"          :beshi
   "アル"          :aru
   "ヤ"            :ya
   "五段"          :godan-verb
   "一般"          :general
   "デス"          :desu
   "リ"            :ri
   "ナリ"          :nari
   "文語上一段"    :classical-kamiichidan-verb-i-row
   "無変化型"      :uninflected-form
   "ズ"            :zu
   "ジャ"          :ja
   "文語カ行変格"  :classical-ka-column-change
   "イウ"          :iu
   })

(defn split-dashes [s] (string/split s #"-"))

; Wrapper for all methods in [1] and [2]
; [1] UniDic-specific `Token` methods:
;     https://github.com/atilika/kuromoji/blob/master/kuromoji-unidic/src/main/java/com/atilika/kuromoji/unidic/Token.java
; [2] Parent `TokenBase` methods:
;     https://github.com/atilika/kuromoji/blob/master/kuromoji-core/src/main/java/com/atilika/kuromoji/TokenBase.java
(defn token-to-map [token]
  {:lemma (.getLemma token)
   :lemma-reading (.getLemmaReadingForm token)
   :lemma-pronunciation (.getPronunciationBaseForm token)
   :literal-pronunciation (.getPronunciation token)
   :part-of-speech (mapv #(or (get keywordize-pos %)
                              :unknown-pos)
                         (filter #(not (= % "*"))
                                 [(.getPartOfSpeechLevel1 token)
                                  (.getPartOfSpeechLevel2 token)
                                  (.getPartOfSpeechLevel3 token)
                                  (.getPartOfSpeechLevel4 token)]))
   :conjugation (mapv #(or (get keywordize-inflection %)
                           :unknown-inflection)
                      (split-dashes (.getConjugationForm token)))
   :conjugation-type (mapv #(or (get keywordize-inflection-type %)
                                :unknown-inflection-type)
                           (filter #(not (= % "*"))
                                   (split-dashes (.getConjugationType token))))
   :written-form (.getWrittenForm token)
   :written-base-form (.getWrittenBaseForm token)
   :language-type (.getLanguageType token)
   :initial-sound-alternation-type (.getInitialSoundAlterationType token)
   :initial-sound-alternation-form (.getInitialSoundAlterationForm token)
   :final-sound-alternation-type (.getFinalSoundAlterationType token)
   :final-sound-alternation-form (.getFinalSoundAlterationForm token)
   ; from TokenBase.java
   :literal (.getSurface token)
   :known? (.isKnown token)
   :user? (.isUser token)
   :position (.getPosition token)
   :all-features (string/split (.getAllFeatures token) #",")
   })

(def s "お寿司が食べたい。")

(defn -main
  [& args]
  (let [t (Tokenizer.)
        ; all-results is a list of maps
        all-results (map token-to-map (.tokenize t s))]
    ; fancy pretty-printing. Use sorted-map for alphabetized keys.
    (pprint (map #(into (sorted-map) %) all-results))))

## full-output.edn
({:all-features
  ["接頭辞"
   "*"
   "*"
   "*"
   "*"
   "*"
   "オ"
   "御"
   "お"
   "オ"
   "お"
   "オ"
   "和"
   "*"
   "*"
   "促添"
   "基本形"],
  :conjugation [:uninflected],
  :conjugation-type [],
  :final-sound-alternation-form "基本形",
  :final-sound-alternation-type "促添",
  :initial-sound-alternation-form "*",
  :initial-sound-alternation-type "*",
  :known? true,
  :language-type "和",
  :lemma "御",
  :lemma-pronunciation "オ",
  :lemma-reading "オ",
  :literal "お",
  :literal-pronunciation "オ",
  :part-of-speech [:prefix],
  :position 0,
  :user? false,
  :written-base-form "お",
  :written-form "お"}
 {:all-features
  ["名詞"
   "普通名詞"
   "一般"
   "*"
   "*"
   "*"
   "スシ"
   "寿司"
   "寿司"
   "スシ"
   "寿司"
   "スシ"
   "和"
   "ス濁"
   "基本形"
   "*"
   "*"],
  :conjugation [:uninflected],
  :conjugation-type [],
  :final-sound-alternation-form "*",
  :final-sound-alternation-type "*",
  :initial-sound-alternation-form "基本形",
  :initial-sound-alternation-type "ス濁",
  :known? true,
  :language-type "和",
  :lemma "寿司",
  :lemma-pronunciation "スシ",
  :lemma-reading "スシ",
  :literal "寿司",
  :literal-pronunciation "スシ",
  :part-of-speech [:noun :common :general],
  :position 1,
  :user? false,
  :written-base-form "寿司",
  :written-form "寿司"}
 {:all-features
  ["助詞"
   "格助詞"
   "*"
   "*"
   "*"
   "*"
   "ガ"
   "が"
   "が"
   "ガ"
   "が"
   "ガ"
   "和"
   "*"
   "*"
   "*"
   "*"],
  :conjugation [:uninflected],
  :conjugation-type [],
  :final-sound-alternation-form "*",
  :final-sound-alternation-type "*",
  :initial-sound-alternation-form "*",
  :initial-sound-alternation-type "*",
  :known? true,
  :language-type "和",
  :lemma "が",
  :lemma-pronunciation "ガ",
  :lemma-reading "ガ",
  :literal "が",
  :literal-pronunciation "ガ",
  :part-of-speech [:particle :case],
  :position 3,
  :user? false,
  :written-base-form "が",
  :written-form "が"}
 {:all-features
  ["動詞"
   "一般"
   "*"
   "*"
   "下一段-バ行"
   "連用形-一般"
   "タベル"
   "食べる"
   "食べ"
   "タベ"
   "食べる"
   "タベル"
   "和"
   "*"
   "*"
   "*"
   "*"],
  :conjugation [:continuative :general],
  :conjugation-type [:shimoichidan-verb-e-row :ba-column],
  :final-sound-alternation-form "*",
  :final-sound-alternation-type "*",
  :initial-sound-alternation-form "*",
  :initial-sound-alternation-type "*",
  :known? true,
  :language-type "和",
  :lemma "食べる",
  :lemma-pronunciation "タベル",
  :lemma-reading "タベル",
  :literal "食べ",
  :literal-pronunciation "タベ",
  :part-of-speech [:verb :general],
  :position 4,
  :user? false,
  :written-base-form "食べる",
  :written-form "食べ"}
 {:all-features
  ["助動詞"
   "*"
   "*"
   "*"
   "助動詞-タイ"
   "終止形-一般"
   "タイ"
   "たい"
   "たい"
   "タイ"
   "たい"
   "タイ"
   "和"
   "*"
   "*"
   "*"
   "*"],
  :conjugation [:conclusive :general],
  :conjugation-type [:auxiliary :tai],
  :final-sound-alternation-form "*",
  :final-sound-alternation-type "*",
  :initial-sound-alternation-form "*",
  :initial-sound-alternation-type "*",
  :known? true,
  :language-type "和",
  :lemma "たい",
  :lemma-pronunciation "タイ",
  :lemma-reading "タイ",
  :literal "たい",
  :literal-pronunciation "タイ",
  :part-of-speech [:auxiliary-verb],
  :position 6,
  :user? false,
  :written-base-form "たい",
  :written-form "たい"}
 {:all-features
  ["補助記号"
   "句点"
   "*"
   "*"
   "*"
   "*"
   ""
   "。"
   "。"
   ""
   "。"
   ""
   "記号"
   "*"
   "*"
   "*"
   "*"],
  :conjugation [:uninflected],
  :conjugation-type [],
  :final-sound-alternation-form "*",
  :final-sound-alternation-type "*",
  :initial-sound-alternation-form "*",
  :initial-sound-alternation-type "*",
  :known? true,
  :language-type "記号",
  :lemma "。",
  :lemma-pronunciation "",
  :lemma-reading "",
  :literal "。",
  :literal-pronunciation "",
  :part-of-speech [:supplementary-symbol :period],
  :position 8,
  :user? false,
  :written-base-form "。",
  :written-form "。"})

## project.clj
(defproject clojure-kuromoji "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.8.0"]
                 [org.atilika.kuromoji/kuromoji "0.7.7"]
                 [com.atilika.kuromoji/kuromoji-ipadic "0.9.0"]
                 [com.atilika.kuromoji/kuromoji-unidic "0.9.0"]]
  :repositories [["Atilika Open Source repository"
                  "http://www.atilika.org/nexus/content/repositories/atilika"]]
  :main ^:skip-aot clojure-kuromoji.core
  :target-path "target/%s"
  :profiles {:uberjar {:aot :all}})
literal	lemma	part of speech	conjugation	conjugation type
お	御	[:prefix]	[:uninflected]	[]
寿司	寿司	[:noun :common :general]	[:uninflected]	[]
が	が	[:particle :case]	[:uninflected]	[]
食べ	食べる	[:verb :general]	[:continuative :general]	[:shimoichidan-verb-e-row :ba-column]
たい	たい	[:auxiliary-verb]	[:conclusive :general]	[:auxiliary :tai]
。	。	[:supplementary-symbol :period]	[:uninflected]	[]
	(ns clojure-kuromoji.core
	(:require [clojure.string :as string]
	[clojure.pprint :refer [pprint]])
	(:import [com.atilika.kuromoji.unidic Token Tokenizer])
	(:gen-class))

	; see https://gist.github.com/masayu-a/e3eee0637c07d4019ec9
	(def keywordize-pos
	{"代名詞" :pronoun
	"副詞" :adverb
	"助動詞" :auxiliary-verb
	"助詞" :particle
	"係助詞" :binding
	"副助詞" :adverbial
	"接続助詞" :conjunctive
	"格助詞" :case
	"準体助詞" :nominal
	"終助詞" :phrase-final
	"動詞" :verb
	"一般" :general
	"非自立可能" :bound
	"名詞" :noun
	"助動詞語幹" :auxiliary
	"固有名詞" :proper
	"人名" :name
	"名" :firstname
	"姓" :surname
	"地名" :place
	"国" :country
	"数詞" :numeral
	"普通名詞" :common
	"サ変可能" :verbal-suru
	"サ変形状詞可能" :verbal-adjectival
	"副詞可能" :adverbial-suffix
	"助数詞可能" :counter
	"形状詞可能" :adjectival
	"形容詞" :adjective-i
	"形状詞" :adjectival-noun
	"タリ" :tari
	"感動詞" :interjection
	"フィラー" :filler
	"接尾辞" :suffix
	"動詞的" :verbal
	"名詞的" :nominal-suffix
	"助数詞" :counter-suffix
	"形容詞的" :adjective-i-suffix
	"形状詞的" :adjectival-noun-suffix
	"接続詞" :conjunction
	"接頭辞" :prefix
	"空白" :whitespace
	"補助記号" :supplementary-symbol
	"ＡＡ" :ascii-art
	"顔文字" :emoticon
	"句点" :period
	"括弧閉" :bracket-open
	"括弧開" :bracket-close
	"読点" :comma
	"記号" :symbol
	"文字" :character
	"連体詞" :adnominal
	"未知語" :unknown-words
	"カタカナ文" :katakana
	"漢文" :chinese-writing
	"言いよどみ" :hesitation
	"web誤脱" :errors-omissions
	"方言" :dialect
	"ローマ字文" :latin-alphabet
	"新規未知語" :new-unknown-words
	})

	; see https://gist.github.com/masayu-a/3e11168f9330e2d83a68
	(def keywordize-inflection
	{
	"ク語法" :ku-wording
	"仮定形" :conditional
	"一般" :general
	"融合" :integrated
	"命令形" :imperative
	"已然形" :realis
	"補助" :auxiliary-inflection
	"意志推量形" :volitional-tentative
	"未然形" :irrealis
	"サ" :sa
	"セ" :se
	"撥音便" :euphonic-change-n
	"終止形" :conclusive
	"ウ音便" :euphonic-change-u
	"促音便" :euphonic-change-t
	"語幹" :word-stem
	"連体形" :attributive
	"イ音便" :euphonic-change-i
	"省略" :abbreviation
	"連用形" :continuative
	"ト" :change-to
	"ニ" :change-ni
	"長音" :long-sound
	"*" :uninflected
	})

	; see https://gist.github.com/masayu-a/b3ce862336e47736e84f
	(def keywordize-inflection-type
	{"ユク" :yuku
	"ダ行" :da-column
	"ザ行変格" :zahen-verb-irregular
	"ダ" :da
	"タイ" :tai
	"文語ラ行変格" :classical-ra-column-change
	"ワ行" :wa-column
	"コス" :kosu
	"キ" :ki
	"文語下二段" :classical-shimonidan-verb-e-u-row
	"ス" :su
	"ハ行" :ha-column
	"上一段" :kamiichidan-verb-i-row
	"イク" :iku
	"マ行" :ma-column
	"助動詞" :auxiliary
	"シク" :shiku
	"ナ行" :na-column
	"ガ行" :ga-column
	"ム" :mu
	"ア行" :a-column
	"ザンス" :zansu
	"文語形容詞" :classical-adjective
	"タ" :ta
	"伝聞" :reported-speech
	"ナイ" :nai
	"ヘン" :hen
	"文語助動詞" :classical-auxiliary
	"ジ" :ji
	"ワア行" :wa-a-column
	"文語ナ行変格" :classical-na-column-change
	"カ行変格" :kahen-verb-irregular
	"ラシ" :rashi
	"マイ" :mai
	"タリ" :tari
	"呉レル" :kureru
	"形容詞" :adjective
	"ゲナ" :gena
	"一般+う" :general-u
	"ザマス" :zamasu
	"ゴトシ" :gotoshi
	"ヌ" :nu
	"文語上二段" :classical-kaminidan-verb-u-i-row
	"ク" :ku
	"サ行変格" :sahen-verb-irregular
	"ラ行" :ra-column
	"下一段" :shimoichidan-verb-e-row
	"完了" :final
	"ラシイ" :rashii
	"文語四段" :classical-yondan-verb
	"ドス" :dosu
	"ザ行" :za-column
	"ツ" :shi
	"ヤス" :yasu
	"バ行" :ba-column
	"断定" :assertive
	"ナンダ" :nanda
	"ケリ" :keri
	"文語サ行変格" :classical-sa-column-change
	"タ行" :ta-column
	"ケム" :kemu
	"カ行" :ka-column
	"ゲス" :gesu
	"ヤ行" :ya-column
	"マス" :masu
	"レル" :reru
	"サ行" :sa-column
	"文語下一段" :classical-shimoichidan-verb-e-row
	"ベシ" :beshi
	"アル" :aru
	"ヤ" :ya
	"五段" :godan-verb
	"一般" :general
	"デス" :desu
	"リ" :ri
	"ナリ" :nari
	"文語上一段" :classical-kamiichidan-verb-i-row
	"無変化型" :uninflected-form
	"ズ" :zu
	"ジャ" :ja
	"文語カ行変格" :classical-ka-column-change
	"イウ" :iu
	})

	(defn split-dashes [s] (string/split s #"-"))

	; Wrapper for all methods in [1] and [2]
	; [1] UniDic-specific `Token` methods:
	; https://github.com/atilika/kuromoji/blob/master/kuromoji-unidic/src/main/java/com/atilika/kuromoji/unidic/Token.java
	; [2] Parent `TokenBase` methods:
	; https://github.com/atilika/kuromoji/blob/master/kuromoji-core/src/main/java/com/atilika/kuromoji/TokenBase.java
	(defn token-to-map [token]
	{:lemma (.getLemma token)
	:lemma-reading (.getLemmaReadingForm token)
	:lemma-pronunciation (.getPronunciationBaseForm token)
	:literal-pronunciation (.getPronunciation token)
	:part-of-speech (mapv #(or (get keywordize-pos %)
	:unknown-pos)
	(filter #(not (= % "*"))
	[(.getPartOfSpeechLevel1 token)
	(.getPartOfSpeechLevel2 token)
	(.getPartOfSpeechLevel3 token)
	(.getPartOfSpeechLevel4 token)]))
	:conjugation (mapv #(or (get keywordize-inflection %)
	:unknown-inflection)
	(split-dashes (.getConjugationForm token)))
	:conjugation-type (mapv #(or (get keywordize-inflection-type %)
	:unknown-inflection-type)
	(filter #(not (= % "*"))
	(split-dashes (.getConjugationType token))))
	:written-form (.getWrittenForm token)
	:written-base-form (.getWrittenBaseForm token)
	:language-type (.getLanguageType token)
	:initial-sound-alternation-type (.getInitialSoundAlterationType token)
	:initial-sound-alternation-form (.getInitialSoundAlterationForm token)
	:final-sound-alternation-type (.getFinalSoundAlterationType token)
	:final-sound-alternation-form (.getFinalSoundAlterationForm token)
	; from TokenBase.java
	:literal (.getSurface token)
	:known? (.isKnown token)
	:user? (.isUser token)
	:position (.getPosition token)
	:all-features (string/split (.getAllFeatures token) #",")
	})

	(def s "お寿司が食べたい。")

	(defn -main
	[& args]
	(let [t (Tokenizer.)
	; all-results is a list of maps
	all-results (map token-to-map (.tokenize t s))]
	; fancy pretty-printing. Use sorted-map for alphabetized keys.
	(pprint (map #(into (sorted-map) %) all-results))))
	({:all-features
	["接頭辞"
	"*"
	"*"
	"*"
	"*"
	"*"
	"オ"
	"御"
	"お"
	"オ"
	"お"
	"オ"
	"和"
	"*"
	"*"
	"促添"
	"基本形"],
	:conjugation [:uninflected],
	:conjugation-type [],
	:final-sound-alternation-form "基本形",
	:final-sound-alternation-type "促添",
	:initial-sound-alternation-form "*",
	:initial-sound-alternation-type "*",
	:known? true,
	:language-type "和",
	:lemma "御",
	:lemma-pronunciation "オ",
	:lemma-reading "オ",
	:literal "お",
	:literal-pronunciation "オ",
	:part-of-speech [:prefix],
	:position 0,
	:user? false,
	:written-base-form "お",
	:written-form "お"}
	{:all-features
	["名詞"
	"普通名詞"
	"一般"
	"*"
	"*"
	"*"
	"スシ"
	"寿司"
	"寿司"
	"スシ"
	"寿司"
	"スシ"
	"和"
	"ス濁"
	"基本形"
	"*"
	"*"],
	:conjugation [:uninflected],
	:conjugation-type [],
	:final-sound-alternation-form "*",
	:final-sound-alternation-type "*",
	:initial-sound-alternation-form "基本形",
	:initial-sound-alternation-type "ス濁",
	:known? true,
	:language-type "和",
	:lemma "寿司",
	:lemma-pronunciation "スシ",
	:lemma-reading "スシ",
	:literal "寿司",
	:literal-pronunciation "スシ",
	:part-of-speech [:noun :common :general],
	:position 1,
	:user? false,
	:written-base-form "寿司",
	:written-form "寿司"}
	{:all-features
	["助詞"
	"格助詞"
	"*"
	"*"
	"*"
	"*"
	"ガ"
	"が"
	"が"
	"ガ"
	"が"
	"ガ"
	"和"
	"*"
	"*"
	"*"
	"*"],
	:conjugation [:uninflected],
	:conjugation-type [],
	:final-sound-alternation-form "*",
	:final-sound-alternation-type "*",
	:initial-sound-alternation-form "*",
	:initial-sound-alternation-type "*",
	:known? true,
	:language-type "和",
	:lemma "が",
	:lemma-pronunciation "ガ",
	:lemma-reading "ガ",
	:literal "が",
	:literal-pronunciation "ガ",
	:part-of-speech [:particle :case],
	:position 3,
	:user? false,
	:written-base-form "が",
	:written-form "が"}
	{:all-features
	["動詞"
	"一般"
	"*"
	"*"
	"下一段-バ行"
	"連用形-一般"
	"タベル"
	"食べる"
	"食べ"
	"タベ"
	"食べる"
	"タベル"
	"和"
	"*"
	"*"
	"*"
	"*"],
	:conjugation [:continuative :general],
	:conjugation-type [:shimoichidan-verb-e-row :ba-column],
	:final-sound-alternation-form "*",
	:final-sound-alternation-type "*",
	:initial-sound-alternation-form "*",
	:initial-sound-alternation-type "*",
	:known? true,
	:language-type "和",
	:lemma "食べる",
	:lemma-pronunciation "タベル",
	:lemma-reading "タベル",
	:literal "食べ",
	:literal-pronunciation "タベ",
	:part-of-speech [:verb :general],
	:position 4,
	:user? false,
	:written-base-form "食べる",
	:written-form "食べ"}
	{:all-features
	["助動詞"
	"*"
	"*"
	"*"
	"助動詞-タイ"
	"終止形-一般"
	"タイ"
	"たい"
	"たい"
	"タイ"
	"たい"
	"タイ"
	"和"
	"*"
	"*"
	"*"
	"*"],
	:conjugation [:conclusive :general],
	:conjugation-type [:auxiliary :tai],
	:final-sound-alternation-form "*",
	:final-sound-alternation-type "*",
	:initial-sound-alternation-form "*",
	:initial-sound-alternation-type "*",
	:known? true,
	:language-type "和",
	:lemma "たい",
	:lemma-pronunciation "タイ",
	:lemma-reading "タイ",
	:literal "たい",
	:literal-pronunciation "タイ",
	:part-of-speech [:auxiliary-verb],
	:position 6,
	:user? false,
	:written-base-form "たい",
	:written-form "たい"}
	{:all-features
	["補助記号"
	"句点"
	"*"
	"*"
	"*"
	"*"
	""
	"。"
	"。"
	""
	"。"
	""
	"記号"
	"*"
	"*"
	"*"
	"*"],
	:conjugation [:uninflected],
	:conjugation-type [],
	:final-sound-alternation-form "*",
	:final-sound-alternation-type "*",
	:initial-sound-alternation-form "*",
	:initial-sound-alternation-type "*",
	:known? true,
	:language-type "記号",
	:lemma "。",
	:lemma-pronunciation "",
	:lemma-reading "",
	:literal "。",
	:literal-pronunciation "",
	:part-of-speech [:supplementary-symbol :period],
	:position 8,
	:user? false,
	:written-base-form "。",
	:written-form "。"})
	(defproject clojure-kuromoji "0.1.0-SNAPSHOT"
	:description "FIXME: write description"
	:url "http://example.com/FIXME"
	:license {:name "Eclipse Public License"
	:url "http://www.eclipse.org/legal/epl-v10.html"}
	:dependencies [[org.clojure/clojure "1.8.0"]
	[org.atilika.kuromoji/kuromoji "0.7.7"]
	[com.atilika.kuromoji/kuromoji-ipadic "0.9.0"]
	[com.atilika.kuromoji/kuromoji-unidic "0.9.0"]]
	:repositories [["Atilika Open Source repository"
	"http://www.atilika.org/nexus/content/repositories/atilika"]]
	:main ^:skip-aot clojure-kuromoji.core
	:target-path "target/%s"
	:profiles {:uberjar {:aot :all}})