Skip to content

Instantly share code, notes, and snippets.

@duttashi
Last active May 25, 2021 07:13
Show Gist options
  • Save duttashi/3600c3dac016c438c4b0e970776c2d27 to your computer and use it in GitHub Desktop.
Save duttashi/3600c3dac016c438c4b0e970776c2d27 to your computer and use it in GitHub Desktop.
common text data preprocessing regex implementations
# suppose the text data is loaded in a dataframe called, df.
# using regular expressions to clean the text data
#Remove twitter handlers
df.text = df.text.apply(lambda x:re.sub('@[^\s]+','',x))
#remove hashtags
df.text = df.text.apply(lambda x:re.sub(r'\B#\S+','',x))
# Remove URLS
df.text = df.text.apply(lambda x:re.sub(r"http\S+", "", x))
# Remove all the special characters
df.text = df.text.apply(lambda x:' '.join(re.findall(r'\w+', x)))
# Substituting multiple spaces with single space
df.text = df.text.apply(lambda x:re.sub(r'\s+', ' ', x, flags=re.I))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment