Skip to content

Instantly share code, notes, and snippets.

@Navjotbians
Created May 22, 2021 16:50
Show Gist options
  • Save Navjotbians/a5ca6f1e863051cbb5a990b18a5e4182 to your computer and use it in GitHub Desktop.
Save Navjotbians/a5ca6f1e863051cbb5a990b18a5e4182 to your computer and use it in GitHub Desktop.
Patterns present in the dataset
SUBSTITUTIONS = [
(r'\d+', ''), # Delete digits
(r"n't", " not "), # Replace pattern n't -> not
(r"can't", "cannot "), # Replace pattern can't -> cannot
(r"what's", "what is "), # Replace pattern what's -> what is
(r"\'s", " "), # Delete pattern 's
(r"\'ve", " have "), # Replace pattern 've -> have
(r"\'re", " are "), # Replace pattern 're -> are
(r"\'d", " would "), # Replace pattern 'd -> would
(r"\'ll", " will "), # Replace pattern 'll -> will
(r"\'scuse", " excuse "), # Replace pattern 'scuse -> excuse
(r"i'm", "i am"), # Replace pattern i'm -> i am
(r" m ", " am "), # Replace pattern m -> am
('\s+', ' '), # Eliminate duplicate whitespaces using wildcards
('\W', ' ') # Delete non word characters
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment