Skip to content

Instantly share code, notes, and snippets.

@santiagobasulto
Last active May 16, 2021 10:13
Show Gist options
  • Star 50 You must be signed in to star a gist
  • Fork 15 You must be signed in to fork a gist
  • Save santiagobasulto/0870184a6829fefde7bfbd3f60dd3ce0 to your computer and use it in GitHub Desktop.
Save santiagobasulto/0870184a6829fefde7bfbd3f60dd3ce0 to your computer and use it in GitHub Desktop.
Download HumbleBundle books in batch with a simple Python script.

Download HumbleBundle books

This is a quick Python script I wrote to download HumbleBundle books in batch. I bought the amazing Machine Learning by O'Reilly bundle. There were 15 books to download, with 3 different file formats per book. So I scratched a quick script to download all of them in batch.

(Final Result: books downloaded)

It's a simple script, the only problem is extracting the generated HTML from Humble Bundle. Here is a step by step guide:

Step 1: Open the download page

After your purchase, open the download page:

Humble Bundle Download Page

This is how mine looks like

Step 2: Inspect element

I'm using Chrome, but Firefox also works for this. Right click anywhere on the page and click on "Inspect Element":

screenshot at 12-46-50

Once you click on Inspect, the developer window should pop up:

screenshot at 12-47-42

Step 3: Scroll all the way up

Scroll up until you see the initial <html> element. Once you've identified it, right click on it and do: Copy > Copy Element

screenshot at 12-49-14

Step 4: Paste the content

Create a new file in your favorite editor and paste the contents that you've just copied from the previous step.

screenshot at 12-50-33

Use a good name for the html file because we'll use it next. For example: humble_bundle_ml.html

Step 5: Run the command!

Important: this script requires Python 3

Now you're ready to download those books. In your command line tool, create a virtualenv and install dependencies:

$ pip install beautifulsoup4 requests

Now you can invoke the actual command:

$ python hb_download.py humble_bundle_ml.html --epub --pdf

By default it'll download the books in a directory named books/. You can change that with the -d command.

Command Usage

❯ python hb_download.py --help
usage: hb_download.py [-h] [-d DESTINATION_DIR] [--epub] [--pdf] [--mobi]
                      html_file

Download

positional arguments:
  html_file             HTML file to download books from

optional arguments:
  -h, --help            show this help message and exit
  -d DESTINATION_DIR, --destination-dir DESTINATION_DIR
                        Directory where books will be saved
  --epub
  --pdf
  --mobi
import argparse
from pathlib import Path
from urllib.parse import urlparse
import requests
from bs4 import BeautifulSoup
def parse_download_links(html_file_content):
soup = BeautifulSoup(html_file_content)
external_wrapper_div = soup.find('div', class_='js-all-downloads-holder')
wrapper_div = external_wrapper_div.find('div', class_='whitebox-redux')
books = []
for div in wrapper_div.find_all('div'):
data_div = div.find('div', attrs={'data-human-name': True})
if not data_div:
continue
download_div = div.find('div', class_='download-buttons')
download_links = {}
for button_div in download_div.find_all('div', class_='small'):
label = button_div.find('span', class_='label').text
download_link = button_div.find(
'a', class_='a', attrs={'href': True})['href']
download_links[label] = download_link
books.append({
'title': data_div['data-human-name'],
'slug': data_div['data-human-name'].lower().replace(' ', '-'),
'download_links': download_links
})
return books
def safe_create_dir(path):
path.mkdir(exist_ok=True)
def download_file_from_url(base_path, url, chunk_size=None):
chunk_size = chunk_size or (4 * 1024)
filename = urlparse(url).path.replace('/', '')
book_path = base_path / filename
if book_path.exists():
# book already downloaded
return (book_path, False)
with requests.get(url, stream=True) as resp:
with book_path.open('wb') as fp:
for chunk in resp.iter_content(chunk_size=chunk_size):
if chunk:
fp.write(chunk)
return (book_path, True)
def download_books(html_file_content, download_dir='./books', pdf=False, epub=False, mobi=False):
books_parsed = parse_download_links(html_file_content)
base_path = Path(download_dir)
safe_create_dir(base_path)
for book in books_parsed:
book_base_path = base_path / book['title']
safe_create_dir(book_base_path)
download_urls = [
url for should_download, url in [
(pdf, book['download_links'].get('PDF')),
(mobi, book['download_links'].get('MOBI')),
(epub, book['download_links'].get('EPUB')),
]
if should_download
]
for url in download_urls:
result, downloaded = download_file_from_url(book_base_path, url)
if not downloaded:
print("Skipped: ", result)
else:
print("Downloaded: ", result)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Download ')
parser.add_argument(
'html_file', type=argparse.FileType(),
help='HTML file to download books from')
parser.add_argument(
'-d', '--destination-dir', type=str,
help="Directory where books will be saved", default='books')
parser.add_argument('--epub', action='store_true', default=True)
parser.add_argument('--pdf', action='store_true')
parser.add_argument('--mobi', action='store_true')
args = parser.parse_args()
html = args.html_file.read()
download_books(
html, args.destination_dir,
pdf=args.pdf, epub=args.epub, mobi=args.mobi,
)
@GhostofGoes
Copy link

GhostofGoes commented Sep 11, 2018

Thank you for making this awesome script! I ran into an issue when downloading Automate the Boring Stuff with Python: Practical Programming for Total Beginners from the "Linux Geeks" bundle on Windows 10. An exception was raised, NotADirectoryError: [WinError 267] The directory name is invalid: 'books\\Automate the Boring Stuff with Python: Practical Programming for Total Beginners'. The issue is the : in the path.

Fix: Add .replace(':', '') at the end of line 62, with the full line being book_base_path = base_path / book['title'].replace(':', '')

Also, thank you @Susensio. Your solution fixed the other error I got!

@bjarnebuchmann
Copy link

bjarnebuchmann commented Aug 6, 2020

Dear Santiago,
I would really like to get this to work, it seems like a great project with a pretty simple interface. However, I do run into problems, and the script carps with the following error:

[]$ python3 hb_download.py -d HumbleBundle_LinuxGeek --mobi HumbleBundle_LinuxGeek/bundle.html
Traceback (most recent call last):
  File "hb_download.py", line 100, in <module>
    pdf=args.pdf, epub=args.epub, mobi=args.mobi,
  File "hb_download.py", line 76, in download_books
    result, downloaded = download_file_from_url(book_base_path, url)
  File "hb_download.py", line 48, in download_file_from_url
    with requests.get(url, stream=True) as resp:
AttributeError: __exit__

Also, I got a warning regarding soup fallback/default to LXML, but that can be avoided adding a second argument "lxml" to BeautifulSoup() call in line 10.

I know quite a bit about programming and Python, but I am not at all proficient in BeautifulSoup. For instance, I do not know exactly how your script deals with the login procedure to HumbleBundle.

I have introduced to changes suggested by @Susensio and @GhostofGoes, but that did not seem to change anything in the present case.

All help is most welcome.

EDIT:
Changing (around line 50) with requests.get(url, stream=True) as resp: to resp=requests.get(url, stream=True) - and re-identing the following line fixed the above problem for me. However, at this point all downloads are zero-size (empty) files. So, subdirs for books are created, and files too, but no content is actually downloaded.

I suspect that this has to do with login, but I am not sure how the login is performed with this py-script. Trying to use the URLs directly with eg wget, just results in ERROR 403: Forbidden, so presumably the login-cookie needs to be copied over from the Chrome session as well, and somehow included in the soup session.

PS: If you - at some point - figure out to check the md5sums, then it will be a great addition too.

@hsantos78
Copy link

Thank you for making this awesome script! I ran into an issue when downloading Automate the Boring Stuff with Python: Practical Programming for Total Beginners from the "Linux Geeks" bundle on Windows 10. An exception was raised, NotADirectoryError: [WinError 267] The directory name is invalid: 'books\\Automate the Boring Stuff with Python: Practical Programming for Total Beginners'. The issue is the : in the path.

Fix: Add .replace(':', '') at the end of line 62, with the full line being book_base_path = base_path / book['title'].replace(':', '')

Also, thank you @Susensio. Your solution fixed the other error I got!

Hello , just tried .replace(':', '') and was able to download books except it gets an error with a book, I get the next message


Downloaded:  C:\Python\HB\books\20200911M\Illustrated Guide to Home Chemistry Experiments\illustratedguidetohomechemistryexperiments.epub
Downloaded:  C:\Python\HB\books\20200911M\Atmospheric Monitoring with Arduino\atmosphericmonitoringwitharduino.epub
Traceback (most recent call last):
  File "hb_download.py", line 97, in <module>
    download_books(
  File "hb_download.py", line 75, in download_books
    result, downloaded = download_file_from_url(book_base_path, url)
  File "hb_download.py", line 49, in download_file_from_url
    with book_path.open('wb') as fp:
  File "C:\Python\Python38\lib\pathlib.py", line 1218, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "C:\Python\Python38\lib\pathlib.py", line 1074, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Python\\HB\\books\\20200911M\\Make Inventing a Better Mousetrap \\inventingabettermousetrap.epub'

I am currently using Win 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment