Skip to content

Instantly share code, notes, and snippets.

@singularitti
Created March 15, 2024 20:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save singularitti/608d1b0aea7a6c4c855be35921119335 to your computer and use it in GitHub Desktop.
Save singularitti/608d1b0aea7a6c4c855be35921119335 to your computer and use it in GitHub Desktop.
If I want to match spaces between Chinese characters and English characters (in either order), as well as spaces between Chinese characters themselves, what regex should I use? #regex
import re
def remove_spaces(text):
pattern = re.compile(r'(?<=\p{Script=Han})\s+(?=\p{Script=Han}|\p{Script=Latin})|(?<=\p{Script=Latin})\s+(?=\p{Script=Han})')
return pattern.sub('', text)
@singularitti
Copy link
Author

Here's the breakdown:

  • (?<=\p{Script=Han}) is a positive lookbehind assertion that ensures the match is preceded by a Chinese character (Han script).
  • \s+ matches one or more whitespace characters (spaces, tabs, newlines, etc.).
  • (?=\p{Script=Han}|\p{Script=Latin}) is a positive lookahead assertion that ensures the match is followed by either a Chinese character (Han script) or an English character or other Latin script character.
  • | is the alternation operator, which allows matching either the expression before or after it.
  • (?<=\p{Script=Latin}) is a positive lookbehind assertion that ensures the match is preceded by an English character or other Latin script character.
  • (?=\p{Script=Han}) is a positive lookahead assertion that ensures the match is followed by a Chinese character (Han script).

This regex will match spaces in the following cases:

  • Between Chinese characters and English (or other Latin script) characters, in either order.
  • Between Chinese characters themselves.

For example:

  • For the string "你好 world! 这是 一个 example.", the regex will match the space between "你好" and "world!", the space between "这是" and "一个", and the space between "一个" and "example.".
  • For the string "This is an 例子 of mixed 语言.", the regex will match the space between "an" and "例子", as well as the space between "mixed" and "语言.".

As mentioned earlier, the \p{Script=Han} and \p{Script=Latin} syntax for matching characters based on their Unicode script properties may not be supported in all regex flavors. It is supported in many modern programming languages and regex libraries, such as Python, Java, and the .NET framework, but you should check the specific documentation of the language or library you are using to ensure compatibility.

If your regex flavor doesn't support the \p{Script=...} syntax, you can replace \p{Script=Han} with [\u4e00-\u9fff] to match Chinese characters, and \p{Script=Latin} with [a-zA-Z] to match English letters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment