Skip to content

Instantly share code, notes, and snippets.

@hitsujiwool
Last active December 31, 2015 01:09
Show Gist options
  • Save hitsujiwool/7911993 to your computer and use it in GitHub Desktop.
Save hitsujiwool/7911993 to your computer and use it in GitHub Desktop.
Developer Guide for Term Suite Japanese Component

Some addition and modification were done to files other than Japanese components which is placed under japanese directory. Althrough not tested comprehensively in all the languages, these changes have hardly any (negative) side effects on the other language components.

build.gradle

  • Added remote maven repository of my own hosted on GitHub for Japanese component. As this is only for my own convenience, it may be better to clone them and start to manage it on the term-suite repository.

  • Specified the directory for unit testing (under ttc-term-suite/tests/).

eu.project.ttc.resources.SimpleTermFrequency

  • Modified method allaw() not to ignore terms written in Japanese hiragana and katakata character.

uima.sandbox.filter.engines

Introduction

  • This Document is written for developers who try to change the behavior of Term Suite in source code level.
  • Please see Term Suite User Guide only for using it.

The outline of this document follows the Term Suite's overall process which can be expressed as an sequence of UIMA analysis engine, especially focusing on the local and language-dependent changes carried out for Japanese components.

JapaneseSpotter

JapaneseTagger -> JapaneseNormaliser -> JapaneseTermSpotter -> SpotterTSVWriter -> JapaneseFilter -> Contextualizer -> Writer

JapaneseTagger

Instead of TreeTagger we use Japanese morphological analyzer Igo which takes charge of morpheme segmentation, lemmatization, and part-of-speech (POS) tagging. We developed a simple wrapper of Igo (uima-igo) to adjust it for UIMA framework.

  • uima-igo (JapaneseMorphologicalAnalyzer)

To adapt the type of outputs from uima-igo (net.hitsujiwool.uima.igo.types.MechabMorpheme) to the type which is commonly used in Term Suite (eu.project.ttc.types.WordAnnotation), we use 2 general-purpose components.

The former maps the basic information (lemma, surface, POS) earned from uima-igo outputs, and the latter "zips" the annotated features with the help of rules defined in japanese-pos-sub-category-zipping.xml. See Readme.md and their test cases in each repositories for further understaning.

These 3 primitive analysis engines listed above compose aggregate analysis engine JapaneseTagger of whom JapaneseSpotter consists.

JapanesNormalizer

There is no additional change in this process, but some of the unnecessary normalization in Japanese (gender, mood, number and so on) are omitted.

Japanese Filter

In addtion to the stopword-based uima-filter commonly used by all language components, here we adopt uima-regex-filter which filters outs extracted terms by regular expression matching. This is because Igo tends to identify sequences of non-Japanese characters as "noun", which cannot be filted out by simple stopword-list. Both SWT and MWT consist only of symbol, parenthesis and blacket are filtered out in this process. What to be noted here is that not a few Japanese multiword terms include latin alphabets as their components (e.g. "iPS細胞 [iPScell]", "C型肝炎 [Hepatitis C]"), we remain the annotated terms unremoved if there are at least Japanese characters in them.

Others

The other components (such as JapaneseTermSpotter, SpotterTSVWriter and Contextualizer, etc) have no local change for Japanese.

Indexer

No major change for Japanese.

Aligner

No major change for Japanese.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment