Skip to content

Instantly share code, notes, and snippets.

@datawrangling
Created June 18, 2009 23:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save datawrangling/132284 to your computer and use it in GitHub Desktop.
Save datawrangling/132284 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python
# encoding: utf-8
"""
parse_categories.py
convert stupid sql insert format from wikipedia dump into
a tab delimited text file
Usage:
cat enwiki-20090306-categorylinks.sql | ./parse_categories.py> categorylinks.txt
(Or run via Hadoop Streaming)
Created by Peter Skomoroch on 2009-06-18.
Copyright (c) 2009 Data Wrangling LLC. All rights reserved.
"""
import sys, os, re
insert_regex = re.compile('''INSERT INTO \`categorylinks\` VALUES (.*)\;''')
row_regex = re.compile("""(.*),'(.*)','(.*)',(.*)""")
for line in sys.stdin:
match = insert_regex.match(line.strip())
if match is not None:
data = match.groups(0)[0]
rows = data[1:-1].split("),(")
for row in rows:
row_match = row_regex.match(row)
if row_match is not None:
# >>> row_match.groups()
# (305,'People_of_the_Trojan_War','Achilles',20090301193903)
# page_id, category_url
page_id, category_url = row_match.groups()[0], row_match.groups()[1]
sys.stdout.write('%s\t%s\n' % (page_id, category_url))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment