Skip to content

Instantly share code, notes, and snippets.

@NielsMinssen
Last active July 19, 2022 11:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save NielsMinssen/e42c60680052c066b7a01951ed19d2dd to your computer and use it in GitHub Desktop.
Save NielsMinssen/e42c60680052c066b7a01951ed19d2dd to your computer and use it in GitHub Desktop.
#Tokenization
words = word_tokenize(text,language="french",preserve_line=True)
#création d'une liste vide pour aceullir les mots sans ponctutation
words_no_punc = []
#Enlever la ponctuation :
for w in words:
if w.isalpha():
words_no_punc.append(w.lower())
#Supprimer les mots d'arrêts classiques en Français
stopwords = stopwords.words("french")
#Possibilité d'ajouter des mots d'arrêts suplémentaires
stopwords.append("monsieur")
stopwords.append("Monsieur")
stopwords.append("madame")
stopwords.append("Madame")
stopwords.append("mme")
stopwords.append("Mme")
#Liste vide pour stocker les mots nétoyés :
clean_words = []
#Remplissage de la liste avec les mots nétoyés
for w in words_no_punc:
if w not in stopwords:
clean_words.append(w)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment