Skip to content

Instantly share code, notes, and snippets.

@yakovsh
Last active January 17, 2016 15:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yakovsh/2b133897b81b88124661 to your computer and use it in GitHub Desktop.
Save yakovsh/2b133897b81b88124661 to your computer and use it in GitHub Desktop.
Cleaning Up Bad HTML in Perl, Take 2
# Here is another way to cleanup bad HTML with Perl, and convert to XML:
# This approach relies on the HTML::DOMbo module to do the actual conversion
# between HTML and XML, and HTML::TreeBuilder for parsing.
use HTML::DOMbo;
use HTML::TreeBuilder;
use XML::LibXML;
$html_code = '';
# Parse HTML
my $builder = HTML::TreeBuilder->new();
$xml_source = $builder->parse($html_code);
# Convert to XML DOM
$xml_source1 = $xml_source->to_XML_DOM;
# Extract XML and encode UTF-8
$xml_source2 = (encode("utf-8", $xml_source1);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment