Skip to content

Instantly share code, notes, and snippets.

@nguyenvulebinh
Last active April 16, 2022 07:12
Show Gist options
  • Save nguyenvulebinh/2c9bf6f857b5212514cbe44d4a38e2eb to your computer and use it in GitHub Desktop.
Save nguyenvulebinh/2c9bf6f857b5212514cbe44d4a38e2eb to your computer and use it in GitHub Desktop.
Remove unk chars in English and Vietnamese document
import re
CHARACTERS = "0123456789aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍị" \
"ỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ"
PUNCTUATION = ".,?!@%~`#$^&*()-_+=[]{}\|:;\"'<>/"
ALL_CHARS = CHARACTERS + PUNCTUATION
WORD_NORMALIZER = re.compile(r"[^ {}]".format(re.escape(ALL_CHARS)))
def remove_unk_char(text):
return WORD_NORMALIZER.sub(' ', text)
def strip_space(text):
return re.sub(r'\s+', ' ', text.strip())
def format_text(text, remove_threshold=0.2):
text = strip_space(text)
root_len = len(text)
text = remove_unk_char(text)
text = strip_space(text)
if 1 - len(text) / root_len > remove_threshold:
return ""
return text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment