Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
TCL & Perl Web scraper from The Practice of Programming by Brian W. Kernighan and Rob Pike
# geturl.tcl: retrieve document from URL
# input has form [http://]abc.def.com[/whatever...]
regsub "http://" $argv "" argv ;# remove http:// if present
regsub "/" $argv " " argv ;# replace leading / with blank
set so [socket [lindex $argv 0] 80] ;# make network connection
set q "/[lindex $argv 1]"
puts $so "GET $q HTTP/1.0\n\n" ;# send request
flush $so
while {[gets $so line] >= 0 && $line != ""} {} ;# skip header
puts [read $so]
# unhtml.pl: delete HTML tags
while (<>) { # collect all input into single string
$str .= $_; # by concatenating input lines
}
$str =~ s/<[^>]*//g; # delete <...>
$str =~ s/&nbsp;/ /g; # replace &nbsp; by blank
$str =~ s/\s+/\n/g; # compress white space
print $str
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment