Skip to content

Instantly share code, notes, and snippets.

@beatobongco
Last active May 13, 2022 08:20
Show Gist options
  • Save beatobongco/508a484d205801a6eca1cfb07bf50e21 to your computer and use it in GitHub Desktop.
Save beatobongco/508a484d205801a6eca1cfb07bf50e21 to your computer and use it in GitHub Desktop.
Replace non-english sentences
import re
import math
import enchant
d = enchant.Dict("en_US")
def replace_noneng_sents(text: str, replace_with_spaces=True,
split_regex=r"[\.\。\!\?\?]",
clean_regex=r"[^a-zA-Z\']",
threshold=0.5,
debug=False) -> str:
"""Based on a threshold of non-english words, removes non-english sentences and replaces them with spaces.
If replace_with_spaces is False, remove those sentences."""
out = ""
for sent in re.split(split_regex, text):
splitted = sent.split()
engwords = 0
for word in splitted:
_word = re.sub(clean_regex, "", word)
if _word and d.check(_word.lower()):
engwords += 1
if engwords >= math.ceil(len(splitted) * threshold):
# TODO: it will always add a period instead of the punctuation used to split
out += sent + "."
else:
if debug:
print("Removed:", sent.strip())
if replace_with_spaces:
out += " " * (len(sent) + 1)
else:
out = out[:-1]
if replace_with_spaces:
assert len(text) == len(out)
return out
@beatobongco
Copy link
Author

beatobongco commented May 13, 2022

Sample output

text = """English: "I'm Goin' Down" is a rock song written and performed by American singer Bruce Springsteen (pictured). The song was recorded with the E Street Band on May 12–13, 1982, and was released on August 27, 1985, by Columbia Records as the sixth single from his 1984 album Born in the U.S.A. Although Springsteen had changing ideas about the songs to put on the album, "I'm Goin' Down" was ultimately selected for inclusion. The recording is based on an energetic band performance.
Slovenian: Slovenska abeceda, kot jo poznamo danes, ima 25 latiničnih črk. Posebnosti? Slovenska abeceda nima črk q, x, y ali w, zato pa ima č, š in ž. Znak nad črko se imenuje strešica. Izgovorjava ni težka. Ž se izgovarja kot črka s v angleški besedi 'pleasure', š kot sh v 'show' in č kot ch v 'cherry'. 
English:English mystic Julian of Norwich (statue pictured) recovered from a severe illness during which she experienced a series of intense visions of Christ, which she later described in the first known English-language book written by a woman.
German: Leben ist kein Ponyhof. Drück mir die Daumen! Abwarten und Tee trinken. Ich glaub ich spinne.
Japanese: どういたしまして。おはようございます。
Chinese: 不太好。你叫什么名字?很高兴认识你。 
English: Pro tip: you usually use this last phrase when saying goodbye to someone after meeting them for the first time, rather than immediately after being introduced.
Spanish: ¿Qué haces en tu tiempo libre? Me gusta ir… ¿En qué trabajas? ¿Qué haces en tu tiempo libre? ¿Cuáles son tus pasatiempos?
English: Getting to know others and talking about your interests are the bread and butter of learning a language. So you have to know how to express your hobbies!
"""
replace_noneng_sents(text, debug=True, replace_with_spaces=True)

Debug statements:

Removed: Slovenian: Slovenska abeceda, kot jo poznamo danes, ima 25 latiničnih črk
Removed: Posebnosti
Removed: Slovenska abeceda nima črk q, x, y ali w, zato pa ima č, š in ž
Removed: Znak nad črko se imenuje strešica
Removed: Izgovorjava ni težka
Removed: Ž se izgovarja kot črka s v angleški besedi 'pleasure', š kot sh v 'show' in č kot ch v 'cherry'
Removed: German: Leben ist kein Ponyhof
Removed: Drück mir die Daumen
Removed: Abwarten und Tee trinken
Removed: Ich glaub ich spinne
Removed: Japanese: どういたしまして
Removed: おはようございます
Removed: Chinese: 不太好
Removed: 你叫什么名字
Removed: 很高兴认识你
Removed: Spanish: ¿Qué haces en tu tiempo libre
Removed: Me gusta ir… ¿En qué trabajas
Removed: ¿Qué haces en tu tiempo libre
Removed: ¿Cuáles son tus pasatiempos

Output text:

English: "I'm Goin' Down" is a rock song written and performed by American singer Bruce Springsteen (pictured). The song was recorded with the E Street Band on May 12–13, 1982, and was released on August 27, 1985, by Columbia Records as the sixth single from his 1984 album Born in the U.S.A. Although Springsteen had changing ideas about the songs to put on the album, "I'm Goin' Down" was ultimately selected for inclusion. The recording is based on an energetic band performance.                                                                                                                                                                                                                                                                                                                    
English:English mystic Julian of Norwich (statue pictured) recovered from a severe illness during which she experienced a series of intense visions of Christ, which she later described in the first known English-language book written by a woman.                                                                                                                                                                 
English: Pro tip: you usually use this last phrase when saying goodbye to someone after meeting them for the first time, rather than immediately after being introduced.                                                                                                                                   
English: Getting to know others and talking about your interests are the bread and butter of learning a language. So you have to know how to express your hobbies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment