Skip to content

Instantly share code, notes, and snippets.

@hoehrmann
Created April 9, 2012 00:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hoehrmann/2340570 to your computer and use it in GitHub Desktop.
Save hoehrmann/2340570 to your computer and use it in GitHub Desktop.
Download plain text versions of public domain books from EXAMPLE Books.
#!perl -w
use strict;
use warnings;
use LWP::UserAgent;
use HTML::FormatText;
die "Usage: $0 bookid > example.txt\n" unless @ARGV == 1;
my $book = shift @ARGV;
my %seen;
my $next = "PP1";
my $ua = LWP::UserAgent->new;
$ua->agent('Mozilla/5.0');
binmode STDOUT, ':utf8';
$|++;
while (1) {
last if $seen{ $next }++;
my $uri = "http://books.EXAMPLE.com/books?id=${book}&pg=${next}&hl=en&output=text";
warn "Retrieving $next";
my $response = $ua->get($uri);
my $content = $response->decoded_content;
if ($content =~ /pg=([a-zA-Z0-9_-]+)[^"]*" accesskey="n"/) {
$next = $1;
} else {
warn "found no next page!\n";
}
$content =~ s/<body.*?<div id='flow-top-div' class='flow-top-div'>/<body><div>/s;
$content =~ s(</div>\s*</div>\s*</div>\s*</div>\s*.*$)()s;
my $fmt = HTML::FormatText->new;
print $fmt->format_string($content);
sleep 1;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment