Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Python version of Ruby script to preprocess tweets for use in GloVe featurization http://nlp.stanford.edu/projects/glove/
"""
preprocess-twitter.py
python preprocess-twitter.py "Some random text with #hashtags, @mentions and http://t.co/kdjfkdjf (links). :)"
Script for preprocessing tweets by Romain Paulus
with small modifications by Jeffrey Pennington
with translation to Python by Motoki Wu
Translation of Ruby script to create features for GloVe vectors for Twitter data.
http://nlp.stanford.edu/projects/glove/preprocess-twitter.rb
"""
import sys
import re
FLAGS = re.MULTILINE | re.DOTALL
def hashtag(text):
text = text.group()
hashtag_body = text[1:]
if hashtag_body.isupper():
result = "<hashtag> {} <allcaps>".format(hashtag_body)
else:
result = " ".join(["<hashtag>"] + re.split(r"(?=[A-Z])", hashtag_body, flags=FLAGS))
return result
def allcaps(text):
text = text.group()
return text.lower() + " <allcaps>"
def tokenize(text):
# Different regex parts for smiley faces
eyes = r"[8:=;]"
nose = r"['`\-]?"
# function so code less repetitive
def re_sub(pattern, repl):
return re.sub(pattern, repl, text, flags=FLAGS)
text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "<url>")
text = re_sub(r"/"," / ")
text = re_sub(r"@\w+", "<user>")
text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), "<smile>")
text = re_sub(r"{}{}p+".format(eyes, nose), "<lolface>")
text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), "<sadface>")
text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), "<neutralface>")
text = re_sub(r"<3","<heart>")
text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", "<number>")
text = re_sub(r"#\S+", hashtag)
text = re_sub(r"([!?.]){2,}", r"\1 <repeat>")
text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 <elong>")
## -- I just don't understand why the Ruby script adds <allcaps> to everything so I limited the selection.
# text = re_sub(r"([^a-z0-9()<>'`\-]){2,}", allcaps)
text = re_sub(r"([A-Z]){2,}", allcaps)
return text.lower()
if __name__ == '__main__':
_, text = sys.argv
if text == "test":
text = "I TEST alllll kinds of #hashtags and #HASHTAGS, @mentions and 3000 (http://t.co/dkfjkdf). w/ <3 :) haha!!!!!"
tokens = tokenize(text)
print tokens
@drevicko

This comment has been minimized.

Copy link

commented Feb 20, 2017

fyi: I posted a question on the glove google group around problems with the provided ruby script.

@chenyangh

This comment has been minimized.

Copy link

commented Apr 14, 2017

Hi, I found the code is not working with python 3 and yield a "ValueError: split() requires a non-empty pattern match." error. I don't really know how to fix it.

@chenyangh

This comment has been minimized.

Copy link

commented Apr 14, 2017

I fix it by replacing re by regex(pypi) module.
'import regex as re' did the magic.

@gangeshwark

This comment has been minimized.

Copy link

commented Nov 2, 2017

Change line 23 to

result = " {} ".format(hashtag_body.lower())

Make hashtag_body lower case else the tag will be again added by the code in line 57.

@Juancard

This comment has been minimized.

Copy link

commented Feb 10, 2018

Line 43:
text = re_sub(r"/"," / ")

Should be moved after line 48:
text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), "<neutralface>")

Otherwise, you won't be capturing neutral faces that contains char "/" like :/ or :-/

Anyways, thank you for this code, saved me lot of time.

@ppope

This comment has been minimized.

Copy link

commented Feb 20, 2018

Forked version with edits suggested by above comments: https://gist.github.com/ppope/0ff9fa359fb850ecf74d061f3072633a

@Sleemanmunk

This comment has been minimized.

Copy link

commented Apr 13, 2018

What's the license on this? Same as the original?
Public Domain and Dedication license http://opendatacommons.org/licenses/pddl/

@gombru

This comment has been minimized.

Copy link

commented Jun 22, 2018

Found some bugs:

  • It doesn't add the tag to hashtags in caps.
  • It doesn't split hashtags as: #WelcomeRefugees = welcome refugees as the original does.

I paste here my version:

`"""
preprocess-twitter.py
python preprocess-twitter.py "Some random text with #hashtags, @mentions and http://t.co/kdjfkdjf (links). :)"
Script for preprocessing tweets by Romain Paulus
with small modifications by Jeffrey Pennington
with translation to Python by Motoki Wu
Translation of Ruby script to create features for GloVe vectors for Twitter data.
http://nlp.stanford.edu/projects/glove/preprocess-twitter.rb
"""
import sys
import regex as re

FLAGS = re.MULTILINE | re.DOTALL

def hashtag(text):
text = text.group()
hashtag_body = text[1:]
if hashtag_body.isupper():
result = " {} ".format(hashtag_body.lower())
else:
result = " ".join([""] + [re.sub(r"([A-Z])",r" \1", hashtag_body, flags=FLAGS)])
return result

def allcaps(text):
text = text.group()
return text.lower() + " "

def tweet_preprocessing(text):
# Different regex parts for smiley faces
eyes = r"[8:=;]"
nose = r"['`-]?"

# function so code less repetitive
def re_sub(pattern, repl):
    return re.sub(pattern, repl, text, flags=FLAGS)

text = re_sub(r"https?:\/\/\S+\b|www\.(\w+\.)+\S*", "<url>")
text = re_sub(r"@\w+", "<user>")
text = re_sub(r"{}{}[)dD]+|[)dD]+{}{}".format(eyes, nose, nose, eyes), "<smile>")
text = re_sub(r"{}{}p+".format(eyes, nose), "<lolface>")
text = re_sub(r"{}{}\(+|\)+{}{}".format(eyes, nose, nose, eyes), "<sadface>")
text = re_sub(r"{}{}[\/|l*]".format(eyes, nose), "<neutralface>")
text = re_sub(r"/"," / ")
text = re_sub(r"<3","<heart>")
text = re_sub(r"[-+]?[.\d]*[\d]+[:,.\d]*", "<number>")
text = re_sub(r"#\S+", hashtag)
text = re_sub(r"([!?.]){2,}", r"\1 <repeat>")
text = re_sub(r"\b(\S*?)(.)\2{2,}\b", r"\1\2 <elong>")

## -- I just don't understand why the Ruby script adds <allcaps> to everything so I limited the selection.
# text = re_sub(r"([^a-z0-9()<>'`\-]){2,}", allcaps)
text = re_sub(r"([A-Z]){2,}", allcaps)

return text.lower()

if name == 'main':
_, text = sys.argv
if text == "test":
text = "I TEST alllll kinds of #hashtags and #HASHTAGS and #HashTags, @mentions and 3000 (http://t.co/dkfjkdf). w/ <3 :) haha!!!!!"
tokens = tweet_preprocessing(text)
print tokens`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.