Skip to content

Instantly share code, notes, and snippets.

@terrancesnyder
Created November 7, 2011 14:05
Show Gist options
  • Save terrancesnyder/1345094 to your computer and use it in GitHub Desktop.
Save terrancesnyder/1345094 to your computer and use it in GitHub Desktop.
Regex for Japanese
Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna!
([一-龯])
Regex for matching Hirgana or Katakana
([ぁ-んァ-ン])
Regex for matching Non-Hirgana or Non-Katakana
([^ぁ-んァ-ン])
Regex for matching Hirgana or Katakana or basic punctuation (、。’)
([ぁ-んァ-ン\w])
Regex for matching Hirgana or Katakana and random other characters
([ぁ-んァ-ン!:/])
Regex for matching Hirgana
([ぁ-ん])
Regex for matching full-width Katakana (zenkaku 全角)
([ァ-ン])
Regex for matching half-width Katakana (hankaku 半角)
([ァ-ン゙゚])
Regex for matching full-width Numbers (zenkaku 全角)
([0-9])
Regex for matching full-width Letters (zenkaku 全角)
([A-z])
Regex for matching Hiragana codespace characters (includes non phonetic characters)
([ぁ-ゞ])
Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters)
([ァ-ヶ])
Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana)
([ヲ-゚])
Regex for matching Japanese Post Codes
/^¥d{3}¥-¥d{4}$/
/^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/
Regex for matching Japanese mobile phone numbers (keitai bangou)
/^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/
/^0¥d0-¥d{4}-¥d{4}$/
Regex for matching Japanese fixed line phone numbers
/^[0-9-]{6,9}$|^[0-9-]{12}$/
/^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/
@epistularum
Copy link

epistularum commented Sep 26, 2022

This doesn't cover all kanjis. Simple example: 𧓈

@Araxeus
Copy link

Araxeus commented Mar 18, 2023

There is a much easier way to do this:

/\p{Script=Han}|\p{Script=Katakana}|\p{Script=Hiragana}/u

see https://www.regular-expressions.info/unicode.html #Unicode Scripts

@Jaha96
Copy link

Jaha96 commented Apr 21, 2023

Japanese imperial date regex:
Example: 令和5年2月 24 日

([令和|平成|昭和|大正|明治]{2})(\d+)年[\s]?(\d{1,2})[\s]?月[\s]?(\d{1,2})[\s]?日

@golddranks
Copy link

@Jaha96 That doesn't catch all: the first year is commonly marked as 元年 instead of 1年. 令和元年 = year 2019, for example. This case was widely disregarded in many libraries, but in actual life, it was very common to see it written that way.

@garamoi-choi
Copy link

I'm working on Android and \d matches (U+FF10), too.

@brainexcerpts
Copy link

This doesn't cover all kanjis. Simple example: 𧓈

To be fair those kanjis are extremely rare and are not used (they would not show up in dictionnaires or rikaichan like extensions) and 99.99% Japanese would not know about them:
https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_B

Now you can match them with: [𠀀-𪛟]
and to match everything you would simply do: [𠀀-𪛟]|[一-龯]

@techjp
Copy link

techjp commented Apr 11, 2024

@cb372, your list comes close to covering all the kana, but a few characters are still missing. You got 「ゞ」 but missed 「ゝ」 and 「ゟ」, and a few others. I believe this would cover all Hiragana and Katakana separately:

Hiragana = [ぁ-ゖ゛-ゟー]
Katakana = [゠-ヿ]

Combined Hiragana & Katakana would be:

Hiragana+Katakana = [ぁ-ゖ゛-ゟ゠-ヿ]

I used the above hiragana+katakana regex to validate the kana portions of the downloadable version of JMDICT and can confirm that apart from a few errors in the JMDICT data, the kana validation works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment