Skip to content

Instantly share code, notes, and snippets.

@dansimau
Created November 13, 2012 09:44
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dansimau/4064926 to your computer and use it in GitHub Desktop.
Save dansimau/4064926 to your computer and use it in GitHub Desktop.
Recursive grep-like search for extracting URLs from a bunch of files
import os
import re
import sys
# Crazy URL regexp from Gruber
# http://daringfireball.net/2010/07/improved_regex_for_matching_urls
r = re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?]))')
# grep -r
for parent, dnames, fnames in os.walk(sys.argv[1]):
for fname in fnames:
filename = os.path.join(parent, fname)
if os.path.isfile(filename):
with open(filename) as f:
c = 0
for line in f:
c = c + 1
match = r.search(line)
if match:
# <file>:<line>:<match>
print '%s:%s:%s' % (filename, c, match.string[match.start():match.end()])
# <match>
#print match.string[match.start():match.end()]
@dansimau
Copy link
Author

Eg.:

$ python ./grepurls.py /home/dan/git/test
www.gravatar.com/avatar/%s?%s
www.gravatar.com/avatar/%s?%stst
http://flask.pocoo.org/snippets/12/
http://flask.pocoo.org/snippets/12/
http://www.gravatar.com/
https://www.gravatar.com/
https://en.gravatar.com/emails/
https://maps.google.co.uk/maps?q=
http://maps.google.com/maps/api/staticmap?center=
http://www.apache.org/licenses/LICENSE-2.0
http://www.apache.org/licenses/LICENSE-2.0
jquery.org/license
http://www.apache.org/licenses/LICENSE-2.0
http://fortawesome.github.com/Font-Awesome/
http://creativecommons.org/licenses/by/3.0/
http://fortawesome.github.com/Font-Awesome
http://twitter.com/fortaweso_me
http://lemonwi.se
$

@dansimau
Copy link
Author

Example rev 2:

$ python ./grepurls.py /home/dan/git/test
/home/dan/git/test/__init__.py:65:www.gravatar.com/avatar/%s?%s
/home/dan/git/test/__init__.pyc:21:www.gravatar.com/avatar/%s?%st
/home/dan/git/test/blueprints/staff/templates/person.jinja:113:https://maps.google.co.uk/maps?q=
/home/dan/git/test/blueprints/staff/templates/person.jinja:114:http://maps.google.com/maps/api/staticmap?center=
/home/dan/git/test/blueprints/staff/templates/person.jinja:45:http://www.gravatar.com/
/home/dan/git/test/blueprints/staff/templates/person.jinja:46:https://www.gravatar.com/
/home/dan/git/test/blueprints/staff/templates/person.jinja:48:https://en.gravatar.com/emails/
/home/dan/git/test/helpers.py:6:http://flask.pocoo.org/snippets/12/
/home/dan/git/test/helpers.pyc:3:http://flask.pocoo.org/snippets/12/
/home/dan/git/test/static/assets/css/combined.1f4fcb67.css:1:http://www.apache.org/licenses/LICENSE-2.0
/home/dan/git/test/static/assets/css/combined.4fc648c5.css:1:http://www.apache.org/licenses/LICENSE-2.0
/home/dan/git/test/static/assets/js/combined.7b7ecd78.js:1:jquery.org/license
/home/dan/git/test/static/css/bootstrap.css:6:http://www.apache.org/licenses/LICENSE-2.0
/home/dan/git/test/static/css/font-awesome.css:10:http://creativecommons.org/licenses/by/3.0/
/home/dan/git/test/static/css/font-awesome.css:11:http://fortawesome.github.com/Font-Awesome
/home/dan/git/test/static/css/font-awesome.css:20:http://twitter.com/fortaweso_me
/home/dan/git/test/static/css/font-awesome.css:21:http://lemonwi.se
/home/dan/git/test/static/css/font-awesome.css:5:http://fortawesome.github.com/Font-Awesome/
$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment