Skip to content

Instantly share code, notes, and snippets.

@ramcandrews
Last active July 26, 2022 04:39
Show Gist options
  • Save ramcandrews/37fa66db6ee97e687de4a5f365c9735d to your computer and use it in GitHub Desktop.
Save ramcandrews/37fa66db6ee97e687de4a5f365c9735d to your computer and use it in GitHub Desktop.
a python regex to grab every japanese word from an HTML file
import re
with open(rootdir + "something in japanese.html", encoding='utf-8', errors='ignore') as reader:
for line in reader:
words = re.findall(r"[一-龯ぁ-んァ-ン!:/・()ー]*", line)
for word in words:
if word:
print(word)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment