Skip to content

Instantly share code, notes, and snippets.

@Sieboldianus
Last active May 2, 2023 06:18
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Sieboldianus/f0be1516da38f8a439d5201a03d1283d to your computer and use it in GitHub Desktop.
Save Sieboldianus/f0be1516da38f8a439d5201a03d1283d to your computer and use it in GitHub Desktop.
Automate getting monthly PDFs from websites, protected by (simple) credentials, with Selenium and Chromedriver

I have a number of monthly manual tasks that I could not automate so far.

One of them is getting PDF from login-protected websites, saving them in specific folders with naming conventions (renaming etc.) and uploading those to my nextcloud.

The script below is for my Electricity Provider's PDFs. They are behind a simple login form (user & pw) and uploaded for the past 6 months. I always forget to check regularly enough to download all.

Two key takeaways:

  1. Use selenium/standalone-chrome Docker image
  • mount outside download folder directly to the inside selenium/standalone-chrome default Download folder (seluser), so we don't have to modify chrome's default for PDF download locations
docker run -d -p 127.0.0.1:4448:4444 \
    --volume $(pwd)/download:/home/seluser/Downloads/ 
    selenium/standalone-chrome
  1. Use wait times after selenium .get() for session/cookies to be set

This is my code, I think it can be adapted to comparable simple login pages, just replace the find_element parts:

import sys
import logging
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from random import randint
from time import sleep

from pathlib import Path


def list_downloads(driver):
    if not driver.current_url.startswith("chrome://downloads"):
        driver.get("chrome://downloads/")
    return driver.execute_script("""
        var items = document.querySelector('downloads-manager')
            .shadowRoot.getElementById('downloadsList').items;
        if (items.every(e => e.state === "COMPLETE"))
            return items.map(e => e.fileUrl || e.file_url);
        """)


# enable debug logging
root = logging.getLogger()
root.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter(
    '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)

logging.info('Creating folder.')
out_dir = Path.cwd() / "download"
out_dir.mkdir(exist_ok=True)

prefs = {'profile.default_content_settings.popups': 0,
         'download.prompt_for_download': False,
         'download.directory_upgrade': True,
         'plugins.always_open_pdf_externally': True}

options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', prefs)
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
# see https://stackoverflow.com/a/73840130/4556479
options.add_argument("--headless=new")
options.add_argument('--disable-gpu')
options.add_argument('--disable-extensions')


logging.info('Connecting to Remote Chrome')
driver = webdriver.Remote(
    "http://127.0.0.1:4448/wd/hub", options=options)

# Time to wait for element's presence
logging.info('Opening login page')
driver.implicitly_wait(5)
driver.get('https://sample.com/login.php')

# Sleep a random number of seconds
sleep(randint(2, 5))

# Click 'Accept cookies' button
logging.info('Accepting cookies..')
accept_cookis_button = driver.find_element(By.ID, 'cookieNotifyButton')
accept_cookis_button.click()

sleep(randint(2, 5))

logging.info('Sending username and password.')
username_input = driver.find_element(By.CSS_SELECTOR, 'input[name="pin"]')
password_input = driver.find_element(By.CSS_SELECTOR, 'input[name="passwd"]')
username_input.send_keys("xyz")
password_input.send_keys("xyz")

sleep(randint(2, 5))
logging.info('Logging in..')
login_button = driver.find_element(By.NAME, 'login0')
login_button.click()

try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'lh2'))
    )
except NoSuchElementException:
    driver.quit()

# this seems necessary for the PDF to load
link = driver.find_element_by_link_text('Abruf Rechnungsdaten').click()

logging.info('Retrieving PDF')
driver.get('https://sample.com/get_pdf.php?value=-5')

# wait for download to finish
paths = WebDriverWait(driver, 12, 3).until(list_downloads)
print(paths)

driver.quit()

Sources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment