Skip to content

Instantly share code, notes, and snippets.

@dunderrrrrr
Created February 21, 2020 13:33
Show Gist options
  • Save dunderrrrrr/250be90d105fb55acbf647f8ea329367 to your computer and use it in GitHub Desktop.
Save dunderrrrrr/250be90d105fb55acbf647f8ea329367 to your computer and use it in GitHub Desktop.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Create virtualenvs and install bs4

$ mkvirtualenv --python=/usr/bin/python3 bs4test
(bs4test)$ pip install beautifulsoup4
(bs4test)$ pip install requests

The following will print the entire page in HTML as an bs4-object.

from bs4 import BeautifulSoup
import requests

def main():
    url = 'https://yourdomain.com'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    print(soup)

if __name__ == '__main__':
    main()

Extract specific div by id or class

soup.find("div", {"id": "articlebody"})
soup.findAll("div", {"class": "stylelistrow"})

Extract tables

table = soup.find( "table", {"title":"TheTitle"} )
rows=list()
for row in table.findAll("tr"):
   rows.append(row)

Find all images

images = []
for img in soup.findAll('img'):
    images.append(img.get('src'))
print(images)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment