Skip to content

Instantly share code, notes, and snippets.

@niklasb
Created January 15, 2012 20:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save niklasb/1617297 to your computer and use it in GitHub Desktop.
Save niklasb/1617297 to your computer and use it in GitHub Desktop.
Extract TLDs from Chrome's effective_tld_names.dat and dump as a Javascript regex
import re
import sys
import codecs
def js_str_chunks(l, n):
tmp = []
size = 0
for x in l:
char_size = len(repr(x)) - 3
if size + char_size > n:
yield repr(''.join(tmp))[1:]
tmp = []
size = 0
tmp += x
size += char_size
yield repr(''.join(tmp))[1:]
tlds = (line.strip() for line in codecs.open(sys.argv[1], 'r', 'utf-8')
if line.strip() and not line.startswith('//'))
regex = '\\.(%s)$' % '|'.join(re.escape(t) for t in tlds)
print 'var tldRegex = new RegExp(\n%s\n);' \
% '+\n'.join(' %s' % line for line in js_str_chunks(regex, 70))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment