Skip to content

Instantly share code, notes, and snippets.

@otov4its
Last active July 12, 2016 13:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save otov4its/baf54dbc48d3296ebe2334a0c75bcb24 to your computer and use it in GitHub Desktop.
Save otov4its/baf54dbc48d3296ebe2334a0c75bcb24 to your computer and use it in GitHub Desktop.
This gist parses 99% twitter-like hashtags in unicode
# -*- coding: utf-8 -*-
import re
# Because 're' module do not support \p{L} unicode class
# need to use simplistic regex with \w instead \p{L}. But such regexp
# also match digits like #123, #003 and so on.
# So it need to further filter out digit-only tags.
# See https://regex101.com/r/cK5oJ0/
HASHTAG_EXP = r'(?:^|_|[^\w&/]+)(?:#|#)([\wÀ-ÖØ-öø-ÿ]+)'
HASHTAG_REGEX = re.compile(HASHTAG_EXP, re.UNICODE | re.IGNORECASE)
def parse_hashtags_from_text(text, to_lower_case=True):
return (
t.lower() if to_lower_case else t
for t in HASHTAG_REGEX.findall(unicode(text))
if not t.isdigit()
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment