Skip to content

Instantly share code, notes, and snippets.

@Syncrossus
Last active October 22, 2021 13:55
Show Gist options
  • Save Syncrossus/b4034d03d8f1e24bac804acefc917ff2 to your computer and use it in GitHub Desktop.
Save Syncrossus/b4034d03d8f1e24bac804acefc917ff2 to your computer and use it in GitHub Desktop.
A collection of useful regular expressions, in PCRE and Python syntax
# Garbage matcher
# matches any string consisting of only |, <, >, +, *, ^, #, =, and hyphen chains.
# this is to identify patterns like ++<===> ######## <---->^^ which serve no purpose but to decorate the text
/(\||(--+)|(__+)|<|>|\+|\*|\^|#|=|~)+|(\\|_|\/){2,}/
# Garbage matcher 2
# matches anything that isn't a letter, space or basic punctuation.
# this is typically useful for cleaning up emojis
/.(?<!([a-zA-Z0-9]|,|\.|'| |\?|\!))/
# layout hyphenation matcher, typically matches hyphens used as bullet points
/(^\ *-\ )/
# e-mail address matcher
# the theoretical character limit for top-level domains is 63 characters
/([a-z]|[0-9]|\.|\+|-)+@([a-z]|[0-9]|\.|-)+\.[a-z]{2,63}/
# phone number matcher
# see this stackoverflow post explaining how mant digits to account for in a phone number
# https://stackoverflow.com/a/4729239/2980717
/\(?\+?(\(?[0-9]\)?(-|\ |\.)?){6,30}[0-9]/
# URL matcher
/https?\:\/\/(www\.)?([A-z]|[0-9]|\.|-|%)+\.[A-z]{2,63}(\/([A-z]|[0-9]|-|\.|_|#)+)*\/?(\?([A-z]|[0-9]|\.|-|%|=|&|_|#|\:|\+)+)?/
# https?\:\/\/(www\.)? # pretty self explanatory
# ([A-z]|[0-9]|\.|-|%)+ # adress
# +\.[A-z]{2,63} # domain name
# (\/([A-z]|[0-9]|-|\.|_|#)+)*\/? # /stuff/between/slashes/
# (\?([A-z]|[0-9]|\.|-|%|=|&|_|#|\:|\+)+)? # request details after ?
# PAT at the beginning of each regex name stands for Pattern
PAT_GARBAGE = r"((\||(--+)|(__+)|<|>|\+|\*|\^|#|=|~)+|(\\|_|/){2,})"
PAT_GARBAGE_2 = r".(?<!([a-zA-Z0-9]|,|\.|'| |\?|\!))"
PAT_BULLET_HYPHENS = r"(^\s*-\ )"
PAT_EMAIL = r"(([A-z]|[0-9]|\.|\+)+@([A-z]|[0-9]|\.|-)+\.[A-z]{2,63})"
PAT_PHONE = r"(\(?\+?(\(?[0-9]\)?(-|\ |\.)?){6,30}[0-9])"
PAT_URL = r"(https?\://(www\.)?([A-z]|[0-9]|\.|-|%)+\.[A-z]{2,63}(/([A-z]|[0-9]|-|\.|_|#)+)*/?(\?([A-z]|[0-9]|\.|-|%|=|&|_|#|\:|\+)+)?)"
@Syncrossus
Copy link
Author

Syncrossus commented Jul 3, 2018

This code is released under the WTFPL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment