Skip to content

Instantly share code, notes, and snippets.

@santiagobasulto
Last active May 16, 2021 10:13
Show Gist options
  • Star 50 You must be signed in to star a gist
  • Fork 15 You must be signed in to fork a gist
  • Save santiagobasulto/0870184a6829fefde7bfbd3f60dd3ce0 to your computer and use it in GitHub Desktop.
Save santiagobasulto/0870184a6829fefde7bfbd3f60dd3ce0 to your computer and use it in GitHub Desktop.
Download HumbleBundle books in batch with a simple Python script.

Download HumbleBundle books

This is a quick Python script I wrote to download HumbleBundle books in batch. I bought the amazing Machine Learning by O'Reilly bundle. There were 15 books to download, with 3 different file formats per book. So I scratched a quick script to download all of them in batch.

(Final Result: books downloaded)

It's a simple script, the only problem is extracting the generated HTML from Humble Bundle. Here is a step by step guide:

Step 1: Open the download page

After your purchase, open the download page:

Humble Bundle Download Page

This is how mine looks like

Step 2: Inspect element

I'm using Chrome, but Firefox also works for this. Right click anywhere on the page and click on "Inspect Element":

screenshot at 12-46-50

Once you click on Inspect, the developer window should pop up:

screenshot at 12-47-42

Step 3: Scroll all the way up

Scroll up until you see the initial <html> element. Once you've identified it, right click on it and do: Copy > Copy Element

screenshot at 12-49-14

Step 4: Paste the content

Create a new file in your favorite editor and paste the contents that you've just copied from the previous step.

screenshot at 12-50-33

Use a good name for the html file because we'll use it next. For example: humble_bundle_ml.html

Step 5: Run the command!

Important: this script requires Python 3

Now you're ready to download those books. In your command line tool, create a virtualenv and install dependencies:

$ pip install beautifulsoup4 requests

Now you can invoke the actual command:

$ python hb_download.py humble_bundle_ml.html --epub --pdf

By default it'll download the books in a directory named books/. You can change that with the -d command.

Command Usage

❯ python hb_download.py --help
usage: hb_download.py [-h] [-d DESTINATION_DIR] [--epub] [--pdf] [--mobi]
                      html_file

Download

positional arguments:
  html_file             HTML file to download books from

optional arguments:
  -h, --help            show this help message and exit
  -d DESTINATION_DIR, --destination-dir DESTINATION_DIR
                        Directory where books will be saved
  --epub
  --pdf
  --mobi
import argparse
from pathlib import Path
from urllib.parse import urlparse
import requests
from bs4 import BeautifulSoup
def parse_download_links(html_file_content):
soup = BeautifulSoup(html_file_content)
external_wrapper_div = soup.find('div', class_='js-all-downloads-holder')
wrapper_div = external_wrapper_div.find('div', class_='whitebox-redux')
books = []
for div in wrapper_div.find_all('div'):
data_div = div.find('div', attrs={'data-human-name': True})
if not data_div:
continue
download_div = div.find('div', class_='download-buttons')
download_links = {}
for button_div in download_div.find_all('div', class_='small'):
label = button_div.find('span', class_='label').text
download_link = button_div.find(
'a', class_='a', attrs={'href': True})['href']
download_links[label] = download_link
books.append({
'title': data_div['data-human-name'],
'slug': data_div['data-human-name'].lower().replace(' ', '-'),
'download_links': download_links
})
return books
def safe_create_dir(path):
path.mkdir(exist_ok=True)
def download_file_from_url(base_path, url, chunk_size=None):
chunk_size = chunk_size or (4 * 1024)
filename = urlparse(url).path.replace('/', '')
book_path = base_path / filename
if book_path.exists():
# book already downloaded
return (book_path, False)
with requests.get(url, stream=True) as resp:
with book_path.open('wb') as fp:
for chunk in resp.iter_content(chunk_size=chunk_size):
if chunk:
fp.write(chunk)
return (book_path, True)
def download_books(html_file_content, download_dir='./books', pdf=False, epub=False, mobi=False):
books_parsed = parse_download_links(html_file_content)
base_path = Path(download_dir)
safe_create_dir(base_path)
for book in books_parsed:
book_base_path = base_path / book['title']
safe_create_dir(book_base_path)
download_urls = [
url for should_download, url in [
(pdf, book['download_links'].get('PDF')),
(mobi, book['download_links'].get('MOBI')),
(epub, book['download_links'].get('EPUB')),
]
if should_download
]
for url in download_urls:
result, downloaded = download_file_from_url(book_base_path, url)
if not downloaded:
print("Skipped: ", result)
else:
print("Downloaded: ", result)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Download ')
parser.add_argument(
'html_file', type=argparse.FileType(),
help='HTML file to download books from')
parser.add_argument(
'-d', '--destination-dir', type=str,
help="Directory where books will be saved", default='books')
parser.add_argument('--epub', action='store_true', default=True)
parser.add_argument('--pdf', action='store_true')
parser.add_argument('--mobi', action='store_true')
args = parser.parse_args()
html = args.html_file.read()
download_books(
html, args.destination_dir,
pdf=args.pdf, epub=args.epub, mobi=args.mobi,
)
@Susensio
Copy link

Susensio commented Aug 29, 2018

Great work dude! Although the script is failing in "An Introduction to Machine Learning Interpretability" because of the missing pdf and mobi formats. Adding if url is not None: after line 73 solves the issue.

@GhostofGoes
Copy link

GhostofGoes commented Sep 11, 2018

Thank you for making this awesome script! I ran into an issue when downloading Automate the Boring Stuff with Python: Practical Programming for Total Beginners from the "Linux Geeks" bundle on Windows 10. An exception was raised, NotADirectoryError: [WinError 267] The directory name is invalid: 'books\\Automate the Boring Stuff with Python: Practical Programming for Total Beginners'. The issue is the : in the path.

Fix: Add .replace(':', '') at the end of line 62, with the full line being book_base_path = base_path / book['title'].replace(':', '')

Also, thank you @Susensio. Your solution fixed the other error I got!

@bjarnebuchmann
Copy link

bjarnebuchmann commented Aug 6, 2020

Dear Santiago,
I would really like to get this to work, it seems like a great project with a pretty simple interface. However, I do run into problems, and the script carps with the following error:

[]$ python3 hb_download.py -d HumbleBundle_LinuxGeek --mobi HumbleBundle_LinuxGeek/bundle.html
Traceback (most recent call last):
  File "hb_download.py", line 100, in <module>
    pdf=args.pdf, epub=args.epub, mobi=args.mobi,
  File "hb_download.py", line 76, in download_books
    result, downloaded = download_file_from_url(book_base_path, url)
  File "hb_download.py", line 48, in download_file_from_url
    with requests.get(url, stream=True) as resp:
AttributeError: __exit__

Also, I got a warning regarding soup fallback/default to LXML, but that can be avoided adding a second argument "lxml" to BeautifulSoup() call in line 10.

I know quite a bit about programming and Python, but I am not at all proficient in BeautifulSoup. For instance, I do not know exactly how your script deals with the login procedure to HumbleBundle.

I have introduced to changes suggested by @Susensio and @GhostofGoes, but that did not seem to change anything in the present case.

All help is most welcome.

EDIT:
Changing (around line 50) with requests.get(url, stream=True) as resp: to resp=requests.get(url, stream=True) - and re-identing the following line fixed the above problem for me. However, at this point all downloads are zero-size (empty) files. So, subdirs for books are created, and files too, but no content is actually downloaded.

I suspect that this has to do with login, but I am not sure how the login is performed with this py-script. Trying to use the URLs directly with eg wget, just results in ERROR 403: Forbidden, so presumably the login-cookie needs to be copied over from the Chrome session as well, and somehow included in the soup session.

PS: If you - at some point - figure out to check the md5sums, then it will be a great addition too.

@hsantos78
Copy link

Thank you for making this awesome script! I ran into an issue when downloading Automate the Boring Stuff with Python: Practical Programming for Total Beginners from the "Linux Geeks" bundle on Windows 10. An exception was raised, NotADirectoryError: [WinError 267] The directory name is invalid: 'books\\Automate the Boring Stuff with Python: Practical Programming for Total Beginners'. The issue is the : in the path.

Fix: Add .replace(':', '') at the end of line 62, with the full line being book_base_path = base_path / book['title'].replace(':', '')

Also, thank you @Susensio. Your solution fixed the other error I got!

Hello , just tried .replace(':', '') and was able to download books except it gets an error with a book, I get the next message


Downloaded:  C:\Python\HB\books\20200911M\Illustrated Guide to Home Chemistry Experiments\illustratedguidetohomechemistryexperiments.epub
Downloaded:  C:\Python\HB\books\20200911M\Atmospheric Monitoring with Arduino\atmosphericmonitoringwitharduino.epub
Traceback (most recent call last):
  File "hb_download.py", line 97, in <module>
    download_books(
  File "hb_download.py", line 75, in download_books
    result, downloaded = download_file_from_url(book_base_path, url)
  File "hb_download.py", line 49, in download_file_from_url
    with book_path.open('wb') as fp:
  File "C:\Python\Python38\lib\pathlib.py", line 1218, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "C:\Python\Python38\lib\pathlib.py", line 1074, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Python\\HB\\books\\20200911M\\Make Inventing a Better Mousetrap \\inventingabettermousetrap.epub'

I am currently using Win 10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment