Skip to content

Instantly share code, notes, and snippets.

@kjam
Created January 20, 2017 17:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kjam/e31d4e50d9d5b50ca9337e3b677d20fa to your computer and use it in GitHub Desktop.
Save kjam/e31d4e50d9d5b50ca9337e3b677d20fa to your computer and use it in GitHub Desktop.
TCL & Perl Web scraper from The Practice of Programming by Brian W. Kernighan and Rob Pike
# geturl.tcl: retrieve document from URL
# input has form [http://]abc.def.com[/whatever...]
regsub "http://" $argv "" argv ;# remove http:// if present
regsub "/" $argv " " argv ;# replace leading / with blank
set so [socket [lindex $argv 0] 80] ;# make network connection
set q "/[lindex $argv 1]"
puts $so "GET $q HTTP/1.0\n\n" ;# send request
flush $so
while {[gets $so line] >= 0 && $line != ""} {} ;# skip header
puts [read $so]
# unhtml.pl: delete HTML tags
while (<>) { # collect all input into single string
$str .= $_; # by concatenating input lines
}
$str =~ s/<[^>]*//g; # delete <...>
$str =~ s/&nbsp;/ /g; # replace &nbsp; by blank
$str =~ s/\s+/\n/g; # compress white space
print $str
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment