Skip to content

Instantly share code, notes, and snippets.

@t-redactyl
Last active June 22, 2017 08:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save t-redactyl/4297c8e01e5b37e8a4fdb0fea2ed93dd to your computer and use it in GitHub Desktop.
Save t-redactyl/4297c8e01e5b37e8a4fdb0fea2ed93dd to your computer and use it in GitHub Desktop.
Function designed to strip out all numbers (alphabetic - English only - and numeric) from a string as part of a text normalisation process.
# Function designed to strip out all numbers (alphabetic - English only - and numeric) from a string as part of a
# text normalisation process.
# Based on the text2num package (https://github.com/ghewgill/text2num) and using code from
# here (http://stackoverflow.com/questions/25346058/removing-list-of-words-from-a-string)
from string import digits
# List of number terms
nums = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven',
'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen',
'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'hundred',
'thousand', 'million', 'billion', 'trillion', 'quadrillion', 'quintillion', 'sextillion',
'septillion', 'octillion', 'nonillion', 'decillion']
def remove_numbers(s):
"""
Removes all numbers from strings, both alphabetic (in English) and numeric. Intended to be
part of a text normalisation process. If the number contains 'and' or commas, these are
left behind on the assumption the text will be cleaned further to remove punctuation
and stop-words.
"""
query = s.replace('-', ' ').lower().split(' ')
resultwords = [word for word in query if word not in nums]
noText = ' '.join(resultwords).encode('utf-8')
noNums = noText.translate(None, digits).replace(' ', ' ')
return noNums
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment