Skip to content

Instantly share code, notes, and snippets.

@kshirsagarsiddharth
Created December 27, 2022 09:40
Show Gist options
  • Save kshirsagarsiddharth/8fdd9346ee2f41adde1f27354dc96343 to your computer and use it in GitHub Desktop.
Save kshirsagarsiddharth/8fdd9346ee2f41adde1f27354dc96343 to your computer and use it in GitHub Desktop.
import spacy
# Load the English model
nlp = spacy.load("en_core_web_sm")
def clean_social_media_data(text):
# Process the text
doc = nlp(text)
# Extract the lemmas and remove stop words
tokens = [token.lemma_ for token in doc if not token.is_stop]
# Remove punctuation and non-alphabetic characters
tokens = [token for token in tokens if token.isalpha()]
# Remove words that are shorter than three characters
tokens = [token for token in tokens if len(token) > 2]
# Join the tokens into a single string
clean_text = ' '.join(tokens)
return clean_text
# Test the function
text = 'I had a great time at the party last night! 😎 #party #friends @siddharth @sid'
clean_social_media_data(text)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment