Skip to content

Instantly share code, notes, and snippets.

View scm's full-sized avatar

Stephen Murcott scm

  • OrganiseOnEarth
  • Cape Town
View GitHub Profile
@davidfraser
davidfraser / OfficeCleaner.py
Created March 19, 2013 08:24
Clean up XHTML by removing extraneous things - in particular those generated by copying and pasting out of Microsoft Office products
import cssutils
from xml.sax import saxutils
from lxml.html import tostring, fromstring, clean
from lxml import etree
import logging
class Cleaner(clean.Cleaner):
def clean_html(self, html):
if not isinstance(html, unicode):