Last active
July 15, 2017 19:07
-
-
Save palashkulsh/7338888a23ca89a9bc6b1621047692e4 to your computer and use it in GitHub Desktop.
Downloading entire website using wget
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
wget -l 1 --recursive --page-requisites --html-extension --domains elastic.co --no-parent https://www.elastic.co/guide/en/logstash/current/ | |
--page-requisites > all css pics and all that | |
--html-extension > files downloaded with html extenstion | |
--domains > scrape files only of this domain | |
--no-parent > don't follow links outside the directory https://www.elastic.co/guide/en/logstash/current/. | |
converting website to pdf | |
find ../www.elastic.co/guide/en/logstash/current/ -iname "*.html" -exec electron-pdf {} {}.pdf \; | |
After all files are converted to pdf | |
go to website and find the order of pages from website like in this case oneliner gave the order of pdf | |
$x('//*[@id="guide"]/div/div/div[1]/div[2]/div[2]/ul/li/ul/li/span/a/@href').forEach(function(k){console.log(k.nodeValue)}) | |
next combine the pdf in the order retrieved from prev command | |
pdftk 1.pdf 2.pdf 3.pdf .... n.pdf cat output finalbook.pdf | |
finally compress the final pdf with ghostscript | |
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=compressedlogstashbook.pdf logstashbook.pdf | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment