Beautiful Soup is a popular Python library used for parsing HTML and XML documents. It provides a simple and intuitive interface to extract data from web pages or analyze structured markup data.
With Beautiful Soup, you can navigate and search through the parse tree, extract desired information, and even modify the content of the document. Whether you are scraping data from websites, parsing XML files, or working with HTML documents, Beautiful Soup can greatly simplify the process.
One of the key features of Beautiful Soup is its ability to handle poorly formatted or invalid markup. It is designed to be resilient and flexible, allowing you to work with imperfect HTML or XML and still extract the data you need.
The official Beautiful Soup documentation.
Topic | Description |
---|---|
Name | Beautiful Soup |
Purpose | Python library for parsing HTML and XML documents |
Features | - Navigating parse trees - Searching the parse tree - Modifying the parse tree - Extracting data from HTML/XML documents |
Installation | pip install beautifulsoup4 |
Import Statement | from bs4 import BeautifulSoup |
Initialization | soup = BeautifulSoup(html_doc, 'html.parser') |
Basic Operations | - Searching by tag names - Searching by CSS class or ID - Searching by attribute values |
Extracting Data | - Accessing tag contents - Accessing tag attributes - Extracting text from tags |
Modifying Content | - Modifying tag names - Modifying tag attributes - Modifying tag contents |
Parsing Options | - HTML parser options ('html.parser', 'lxml', 'html5lib', etc.) |
Error Handling | Beautiful Soup handles malformed HTML/XML gracefully and tries to make sense of it |
Examples | - Scraping website data - Extracting information from XML files - Parsing and manipulating HTML documents |
Operation | Command/Function | Example |
---|---|---|
Searching by tag names | find_all(name) |
soup.find_all('a') retorna uma lista de todas as tags <a> no documento analisado |
Searching by CSS class or ID | find_all(attrs={'class': 'nome_da_classe'}) ou find_all(id='valor_do_id') |
soup.find_all(attrs={'class': 'titulo'}) retorna tags com class="titulo" |
Searching by attribute values | find_all(attrs={'atributo': 'valor'}) |
soup.find_all(attrs={'data-type': 'imagem'}) retorna tags com data-type="imagem" |
Operation | Command/Function | Example |
---|---|---|
Accessing tag contents | .contents ou .text |
tag.contents retorna os filhos diretos da tag tag.text retorna o texto dentro da tag |
Accessing tag attributes | .get('atributo') ou .attrs['atributo'] |
tag.get('href') retorna o valor do atributo href da tag |
Extracting text from tags | .get_text() ou .text |
tag.get_text() retorna o texto combinado dentro da tag e seus descendentes |
Operation | Command/Function | Example |
---|---|---|
Modifying tag names | .name = 'novo_nome' |
tag.name = 'h2' altera o nome da tag para <h2> |
Modifying tag attributes | .attrs['atributo'] = 'novo_valor' |
tag.attrs['class'] = 'destaque' altera o valor do atributo class da tag |
Modifying tag contents | .string = 'novo_conteudo' |
tag.string = 'Olá, Mundo!' altera o conteúdo de texto da tag para 'Olá, Mundo!' |
Option | Command/Function | Example |
---|---|---|
HTML parser options | Especificar o parser ao inicializar o Beautiful Soup | soup = BeautifulSoup(html_doc, 'lxml') inicializa o Beautiful Soup com o parser 'lxml' |
Feature | Command/Function | Example |
---|---|---|
Graceful error handling | O Beautiful Soup trata HTML/XML malformados de forma adequada | soup = BeautifulSoup(html_malformado, 'html.parser') trata HTML malformado durante o parsing |