Skip to content

Instantly share code, notes, and snippets.

@mcandre
Forked from sashaphanes/html.table.parser.pl
Last active December 18, 2015 01:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mcandre/5707929 to your computer and use it in GitHub Desktop.
Save mcandre/5707929 to your computer and use it in GitHub Desktop.
#!/usr/bin/perl -ws
#
# just-the-tables.pl
#
# Summary
#
# Strips most HTML formatting, leaving tables.
#
# Example
#
# ./just-the-tables.pl <URL>
#
# The output can be saved to a file,
# or rendered with a command line browser.
#
# ./just-the-tables.pl <URL> > output.html
# links output.html
use HTML::Scrubber;
use HTML::Entities qw(decode_entities);
use Text::Unidecode qw(unidecode);
use LWP::Simple qw(get);
die `more $0` unless $#ARGV == 1;
my $HTML = get($ARGV[1]);
my $scrubber = HTML::Scrubber->new( allow => qw[table tr td] );
my $scrubber = $scrubber->scrub($HTML);
print unidecode(decode_entities($scrubber)) . "\n";
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment