Skip to content

Instantly share code, notes, and snippets.

@bebop-001
Created June 3, 2020 04:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bebop-001/4a99f60a847221c0598e3ff95d46733a to your computer and use it in GitHub Desktop.
Save bebop-001/4a99f60a847221c0598e3ff95d46733a to your computer and use it in GitHub Desktop.
Kotlin unicode-block regexes for extracting various Japanese char types to a string list.
// see: http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
val fullWidthHiraganaRegex = "[ぁ-ゟ]".toRegex()
val fullWidthKatakanaRegex = "[゠-ヿ]".toRegex()
val kanjiRegex = "[㐀-䶵一-鿋豈-頻]".toRegex()
val radicalsRegex = "[⺀-⿕]".toRegex()
val halfWidthKatakanaRegex = "[ア-ン]".toRegex()
val fullWidthAlphaNumRegex = "[!-~]".toRegex()
val japSymbolsRegex = "[、-〿]".toRegex()
val miscSymbolsRegex = "[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]".toRegex()
val asciiCharsRegex = "[ -~]".toRegex()
// use the regex passed in to extract any matches to a List<String>
fun Regex.extractToList(textIn:String) : List<String> {
var rv = mutableListOf<String>()
this.findAll(textIn)?.forEach { rv.add(it.value) }
return rv
}
@bebop-001
Copy link
Author

Use assuming you put file in same directory as caller class:

import kanjiRegex
import extractToList
...
val kanjiList = kanjiRegex.extractToRegex("String with kanji: 皆たち、こんにちは。")

would return a single element list containing "皆"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment