Skip to content

Instantly share code, notes, and snippets.

@John-Appleseed
Last active March 19, 2022 05:35
Show Gist options
  • Save John-Appleseed/c9710f0b6a6309f39276b380460c6b19 to your computer and use it in GitHub Desktop.
Save John-Appleseed/c9710f0b6a6309f39276b380460c6b19 to your computer and use it in GitHub Desktop.
Web archive webpages with browsertrix-crawler
# web.save: $website_url
# - ArchiveWeb.page Install Link: https://chrome.google.com/webstore/detail/webrecorder-archivewebpag/fpeoodllldobpkbkabpblcfaogecpndd
# - webrecorder/browsertrix-crawler: Run a high-fidelity browser-based crawler in a single Docker container - https://github.com/webrecorder/browsertrix-crawler#features
web.save(){
website_url="$1"
archive_collection_name="$(echo "$website_url" | awk -F:// '{print $2}' | tr "./" "_")-$(date "+%Y-%m-%d_%Hh%Mm%Ss")"
website_save_path="$HOME/Downloads/crawls"
screencastport="9037"
mkdir -p "$website_save_path"
browsertrix_crawler_cmd="crawl --screencastPort $screencastport --generateWACZ --workers 2 --text --url "${website_url}" --collection $archive_collection_name"
echo "Saving website, ${website_url}, to ${website_save_path}. "
echo "Visit: http://localhost:$screencastport to watch the webcrawl progress."
echo sudo docker run -v "$website_save_path":/crawls/ -p $screencastport:9037 -it webrecorder/browsertrix-crawler $browsertrix_crawler_cmd
time sudo docker run -v "$website_save_path":/crawls/ -p $screencastport:9037 -it webrecorder/browsertrix-crawler $browsertrix_crawler_cmd
echo "Browsertrix-crawler crawl completed"
ls "$website_save_path/collections/$archive_collection_name/$archive_collection_name.wacz"
echo "Open wacz file with the ArchiveWeb.page Web Extension"
echo "ArchiveWeb.page Install Link: https://chrome.google.com/webstore/detail/webrecorder-archivewebpag/fpeoodllldobpkbkabpblcfaogecpndd?hl=en"
echo "Go to Chrome Web Extension ArchiveWeb.page > Import Archive > Select WACZ file"
echo "$website_save_path/collections/$archive_collection_name/$archive_collection_name.wacz"
echo du -sch "$website_save_path/collections/*"
du -sch "$website_save_path/collections/"*
}
# ---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment