Skip to content

Instantly share code, notes, and snippets.

@sashaphanes
Last active December 18, 2015 01:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save sashaphanes/5701434 to your computer and use it in GitHub Desktop.
Save sashaphanes/5701434 to your computer and use it in GitHub Desktop.
Parse STDIN from bash program "links" HTML download, remove all tags except tables
#!/usr/bin/perl -ws
use HTML::Scrubber;
use HTML::Entities qw(decode_entities);
use Text::Unidecode qw(unidecode);
my $HTMLinput = do {local $/; <STDIN>};
my $scrubber = HTML::Scrubber->new( allow => [ qw[ table tr td ] ] );
#print $scrubber->scrub($HTMLinput);
my $scrubber2 = $scrubber->scrub($HTMLinput);
print unidecode(decode_entities($scrubber2)), "\n";
#in bash:
#links -dump [URL] | html.table.parser.pl
#needed: "links" program for bash
#http://www.jikos.cz/~mikulas/links/
#to install perl packages use:
#sudo perl -MCPAN -e shell
#cpan[1]> install [package.name]
#cpan[2]> exit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment