Created
March 15, 2024 20:19
-
-
Save singularitti/608d1b0aea7a6c4c855be35921119335 to your computer and use it in GitHub Desktop.
If I want to match spaces between Chinese characters and English characters (in either order), as well as spaces between Chinese characters themselves, what regex should I use? #regex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
def remove_spaces(text): | |
pattern = re.compile(r'(?<=\p{Script=Han})\s+(?=\p{Script=Han}|\p{Script=Latin})|(?<=\p{Script=Latin})\s+(?=\p{Script=Han})') | |
return pattern.sub('', text) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Here's the breakdown:
(?<=\p{Script=Han})
is a positive lookbehind assertion that ensures the match is preceded by a Chinese character (Han script).\s+
matches one or more whitespace characters (spaces, tabs, newlines, etc.).(?=\p{Script=Han}|\p{Script=Latin})
is a positive lookahead assertion that ensures the match is followed by either a Chinese character (Han script) or an English character or other Latin script character.|
is the alternation operator, which allows matching either the expression before or after it.(?<=\p{Script=Latin})
is a positive lookbehind assertion that ensures the match is preceded by an English character or other Latin script character.(?=\p{Script=Han})
is a positive lookahead assertion that ensures the match is followed by a Chinese character (Han script).This regex will match spaces in the following cases:
For example:
"你好 world! 这是 一个 example."
, the regex will match the space between"你好"
and"world!"
, the space between"这是"
and"一个"
, and the space between"一个"
and"example."
."This is an 例子 of mixed 语言."
, the regex will match the space between"an"
and"例子"
, as well as the space between"mixed"
and"语言."
.As mentioned earlier, the
\p{Script=Han}
and\p{Script=Latin}
syntax for matching characters based on their Unicode script properties may not be supported in all regex flavors. It is supported in many modern programming languages and regex libraries, such as Python, Java, and the .NET framework, but you should check the specific documentation of the language or library you are using to ensure compatibility.If your regex flavor doesn't support the
\p{Script=...}
syntax, you can replace\p{Script=Han}
with[\u4e00-\u9fff]
to match Chinese characters, and\p{Script=Latin}
with[a-zA-Z]
to match English letters.