Skip to content

Instantly share code, notes, and snippets.

@rmehta
Last active August 29, 2015 14:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rmehta/11338024 to your computer and use it in GitHub Desktop.
Save rmehta/11338024 to your computer and use it in GitHub Desktop.
# Usage:
# Make sure all files are in current folder
# Run `python convert.py`
#
# Output:
# Output file in `out.csv`
# Ignored rows in `ignored.csv`
cols = ['Voter`s Slip', 'AC Number', 'Part Number', 'Section Number', 'SerialNo. in Part ',
'Name in English', ' Name in Hindi', 'Relation FirstName in English', 'Relation FirstName in Hindi',
'Gender', 'ID Card No.']
import re, os, csv
out, ignored = [cols], []
for source in os.listdir("."):
if "." in source and source.rsplit(".", 1)[1] in ("html", "xml"):
with open(source, "r") as f:
print "Converting " + source
html = f.read()
for row in html.split('<tr bgcolor="#D0E0FB">'):
data = re.findall('\>([^<]+)\<', row)
if data[0] == "Click to view":
out.append(data[:len(cols)])
else:
ignored.append(data)
# spit out files
with open("out.csv", "w") as f:
csv.writer(f).writerows(out)
with open("ignored.csv", "w") as f:
csv.writer(f).writerows(ignored)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment