Skip to content

Instantly share code, notes, and snippets.

@hubgit
Created August 18, 2010 15:21
Show Gist options
  • Save hubgit/535108 to your computer and use it in GitHub Desktop.
Save hubgit/535108 to your computer and use it in GitHub Desktop.
strict purification of HTML using HTMLPurifier
<?php
$url = 'http://en.wikipedia.org/wiki/1,1,1-Trichloroethane'; // example
$config = HTMLPurifier_Config::createDefault();
$config->set('URI.Base', $url); // set the base URL (overrides a <base element in the HTML head?)
$config->set('URI.MakeAbsolute', true); // make all URLs absolute using the base URL set above
$config->set('AutoFormat.RemoveEmpty', true); // remove empty elements
$config->set('HTML.Doctype', 'XHTML 1.0 Strict'); // valid XML output (?)
$config->set('HTML.AllowedElements', array('p', 'div', 'a', 'br', 'table', 'thead', 'tbody', 'tr', 'th', 'td', 'ul', 'ol', 'li', 'b', 'i'));
$config->set('HTML.AllowedAttributes', array('a.href')); // remove all attributes except a.href
$config->set('CSS.AllowedProperties', array()); // remove all CSS
$purifier = new HTMLPurifier($config);
$html = file_get_contents($url);
$html = $purifier->purify($html);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment