Created

Embed URL

HTTPS clone URL

SSH clone URL

You can clone with HTTPS or SSH.

Download Gist

Strip headers/footers from Project Gutenberg texts

View stripgutenberg.pl
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
#!/usr/bin/perl
 
# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010
 
use strict;
 
my $debug = 0;
 
my $state = 'beginning';
my $print = 0;
my $printed = 0;
 
while (1) {
$_ = <>;
 
last unless $_;
# strip UTF-8 BOM
if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
$_ = substr($_, 3);
}
 
if ($state eq 'beginning') {
if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
$state = 'normal pg header';
$debug && print "state: beginning -> normal pg header\n";
$print = 0;
} elsif (/^$/) {
$state = 'beginning blanks';
$debug && print "state: beginning -> beginning blanks\n";
} else {
die "unrecognized beginning: $_";
}
} elsif ($state eq 'normal pg header') {
if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
$state = 'end of normal header';
$debug && print "state: normal pg header -> end of normal pg header\n";
} else {
# body of normal pg header
}
} elsif ($state eq 'end of normal header') {
if (/^(Produced by|Transcribed from)/) {
$state = 'post header';
$debug && print "state: end of normal pg header -> post header\n";
} elsif (/^$/) {
# blank lines
} else {
$state = 'etext body';
$debug && print "state: end of normal header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'post header') {
if (/^$/) {
$state = 'blanks after post header';
$debug && print "state: post header -> blanks after post header\n";
} else {
# multiline Produced / Transcribed
}
} elsif ($state eq 'blanks after post header') {
if (/^$/) {
# more blank lines
} else {
$state = 'etext body';
$debug && print "state: blanks after post header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'beginning blanks') {
if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
$state = 'header include';
$debug && print "state: beginning blanks -> header include\n";
} elsif (/^Title: /) {
$state = 'aus header';
$debug && print "state: beginning blanks -> aus header\n";
} elsif (/^$/) {
# more blanks
} else {
die "unexpected stuff after beginning blanks: $_";
}
} elsif ($state eq 'header include') {
if (/^$/) {
# blanks after header include
} else {
$state = 'aus header';
$debug && print "state: header include -> aus header\n";
}
} elsif ($state eq 'aus header') {
if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
$state = 'end of aus header';
$debug && print "state: aus header -> end of aus header\n";
} elsif (/^A Project Gutenberg of Australia eBook$/) {
$state = 'end of aus header';
$debug && print "state: aus header -> end of aus header\n";
}
} elsif ($state eq 'end of aus header') {
if (/^((Title|Author): .*)?$/) {
# title, author, or blank line
} else {
$state = 'etext body';
$debug && print "state: end of aus header -> etext body\n";
$print = 1;
}
} elsif ($state eq 'etext body') {
# here's the stuff
if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
$state = 'footer';
$debug && print "state: etext body -> footer\n";
$print = 0;
} elsif (/^(\*\*\* ?)?end of (the )?project/i) {
$state = 'footer';
$debug && print "state: etext body -> footer\n";
$print = 0;
}
} elsif ($state eq 'footer') {
# nothing more of interest
} else {
die "unknown state '$state'";
}
 
if ($print) {
print;
++$printed;
} else {
$debug && print "## $_";
}
}

thanks a lot for this script, it is so useful!

Owner

Glad you liked it! Be sure to let me know if it doesn't work on some files or if you make any changes. I know it doesn't work on the KJV Bible or Webster's dictionary but those are fairly special anyway.

Thanks a lot!! Haven't used it yet but I have a feeling it will help my group with our DL project.

I've been trying to make the script so it receives input and gives output for many text files in the same directory. Is this something that you think the script could do without many modifications? Any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.