Last active
June 22, 2017 08:04
-
-
Save t-redactyl/4297c8e01e5b37e8a4fdb0fea2ed93dd to your computer and use it in GitHub Desktop.
Function designed to strip out all numbers (alphabetic - English only - and numeric) from a string as part of a text normalisation process.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Function designed to strip out all numbers (alphabetic - English only - and numeric) from a string as part of a | |
# text normalisation process. | |
# Based on the text2num package (https://github.com/ghewgill/text2num) and using code from | |
# here (http://stackoverflow.com/questions/25346058/removing-list-of-words-from-a-string) | |
from string import digits | |
# List of number terms | |
nums = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', | |
'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', | |
'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'hundred', | |
'thousand', 'million', 'billion', 'trillion', 'quadrillion', 'quintillion', 'sextillion', | |
'septillion', 'octillion', 'nonillion', 'decillion'] | |
def remove_numbers(s): | |
""" | |
Removes all numbers from strings, both alphabetic (in English) and numeric. Intended to be | |
part of a text normalisation process. If the number contains 'and' or commas, these are | |
left behind on the assumption the text will be cleaned further to remove punctuation | |
and stop-words. | |
""" | |
query = s.replace('-', ' ').lower().split(' ') | |
resultwords = [word for word in query if word not in nums] | |
noText = ' '.join(resultwords).encode('utf-8') | |
noNums = noText.translate(None, digits).replace(' ', ' ') | |
return noNums |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment