Skip to content

Instantly share code, notes, and snippets.

@bbarker
Created March 31, 2016 21:48
Show Gist options
  • Save bbarker/d00c8151832b6b722f55beccc1b88489 to your computer and use it in GitHub Desktop.
Save bbarker/d00c8151832b6b722f55beccc1b88489 to your computer and use it in GitHub Desktop.
Find regex values in glob of html files and print all unique matches
#!/usr/bin/perl -w
use warnings;
use strict;
use Cwd;
use File::Slurp;
use HTML::Strip;
my $hs = HTML::Strip->new();
my %acronyms;
#my @pages = (glob q("*.aspx"), glob q("*.html"));
my @pages = glob q("*.aspx");
foreach my $page (@pages) {
print "reading $page :::::::\n";
my $raw_html = read_file($page);
my $page_text = $hs->parse( $raw_html );
my @page_acronyms = ( $raw_html =~ /[A-Z]{3,}/g );
foreach my $acronym (@page_acronyms) {
$acronyms{$acronym} = 1;
}
$hs->eof;
}
foreach my $acronym (sort keys %acronyms) {
print "$acronym\n";
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment