Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Create virtualenvs and install bs4
$ mkvirtualenv --python=/usr/bin/python3 bs4test
(bs4test)$ pip install beautifulsoup4
(bs4test)$ pip install requests
The following will print the entire page in HTML as an bs4-object.
from bs4 import BeautifulSoup
import requests
def main():
url = 'https://yourdomain.com'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup)
if __name__ == '__main__':
main()
Extract specific div by id or class
soup.find("div", {"id": "articlebody"})
soup.findAll("div", {"class": "stylelistrow"})
Extract tables
table = soup.find( "table", {"title":"TheTitle"} )
rows=list()
for row in table.findAll("tr"):
rows.append(row)
Find all images
images = []
for img in soup.findAll('img'):
images.append(img.get('src'))
print(images)