Last active
January 17, 2016 15:59
-
-
Save yakovsh/2b133897b81b88124661 to your computer and use it in GitHub Desktop.
Cleaning Up Bad HTML in Perl, Take 2
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Here is another way to cleanup bad HTML with Perl, and convert to XML: | |
# This approach relies on the HTML::DOMbo module to do the actual conversion | |
# between HTML and XML, and HTML::TreeBuilder for parsing. | |
use HTML::DOMbo; | |
use HTML::TreeBuilder; | |
use XML::LibXML; | |
$html_code = ''; | |
# Parse HTML | |
my $builder = HTML::TreeBuilder->new(); | |
$xml_source = $builder->parse($html_code); | |
# Convert to XML DOM | |
$xml_source1 = $xml_source->to_XML_DOM; | |
# Extract XML and encode UTF-8 | |
$xml_source2 = (encode("utf-8", $xml_source1); |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment