Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save stevemclaugh/32f6abf6321728eb21c7c2327a0c9d97 to your computer and use it in GitHub Desktop.
Save stevemclaugh/32f6abf6321728eb21c7c2327a0c9d97 to your computer and use it in GitHub Desktop.

Scraping a page with a headless browser in Python: Selenium WebDriver + PhantomJS

Install dependencies in the bash shell

pip3 install -U selenium

# macOS
brew install phantomjs

# GNU/Linux
apt-get install -y build-essential chrpath libssl-dev libxft-dev libfreetype6-dev libfreetype6 libfontconfig1-dev libfontconfig1
wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
tar xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 -C /usr/local/share/
ln -s /usr/local/share/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin/

Download a JavaScript-rendered page in Python

from selenium import webdriver
driver = webdriver.PhantomJS()

def js_to_html(url):
    driver.get(url)
    return driver.page_source
    
print(js_to_html('https://www.washingtonpost.com/'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment