Skip to content

Instantly share code, notes, and snippets.

@sh78
Created June 5, 2018 22:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sh78/469385b1cd073c3eda9fde2c5642a891 to your computer and use it in GitHub Desktop.
Save sh78/469385b1cd073c3eda9fde2c5642a891 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python3
import fileinput
import re
##
# Extract URLs from file or `stdin`
#
# Prints out a standard list of url address substrings contained in `fileinput`
# Call on a file or pipe to `stdin`
#
# extract-url.py messy_file.txt
# ./some-script --messy | extract-url.py | sort | unique > urls.txt
##
regex = re.compile(r"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+")
string = ""
for line in fileinput.input():
string += line
matches = re.findall(regex, string) # => array
print("\n".join(matches))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment