Skip to content

Instantly share code, notes, and snippets.

@Denilson-Semedo
Last active June 8, 2023 01:29
Show Gist options
  • Save Denilson-Semedo/5ec7321ec3115e2067468b1bbca03f70 to your computer and use it in GitHub Desktop.
Save Denilson-Semedo/5ec7321ec3115e2067468b1bbca03f70 to your computer and use it in GitHub Desktop.
BeautifulSoup Spreadsheet

Introduction to Beautiful Soup

Beautiful Soup is a popular Python library used for parsing HTML and XML documents. It provides a simple and intuitive interface to extract data from web pages or analyze structured markup data.

With Beautiful Soup, you can navigate and search through the parse tree, extract desired information, and even modify the content of the document. Whether you are scraping data from websites, parsing XML files, or working with HTML documents, Beautiful Soup can greatly simplify the process.

One of the key features of Beautiful Soup is its ability to handle poorly formatted or invalid markup. It is designed to be resilient and flexible, allowing you to work with imperfect HTML or XML and still extract the data you need.

The official Beautiful Soup documentation.

Spreadsheet

Topic Description
Name Beautiful Soup
Purpose Python library for parsing HTML and XML documents
Features - Navigating parse trees
- Searching the parse tree
- Modifying the parse tree
- Extracting data from HTML/XML documents
Installation pip install beautifulsoup4
Import Statement from bs4 import BeautifulSoup
Initialization soup = BeautifulSoup(html_doc, 'html.parser')
Basic Operations - Searching by tag names
- Searching by CSS class or ID
- Searching by attribute values
Extracting Data - Accessing tag contents
- Accessing tag attributes
- Extracting text from tags
Modifying Content - Modifying tag names
- Modifying tag attributes
- Modifying tag contents
Parsing Options - HTML parser options ('html.parser', 'lxml', 'html5lib', etc.)
Error Handling Beautiful Soup handles malformed HTML/XML gracefully and tries to make sense of it
Examples - Scraping website data
- Extracting information from XML files
- Parsing and manipulating HTML documents

Basic Operations:

Operation Command/Function Example
Searching by tag names find_all(name) soup.find_all('a') retorna uma lista de todas as tags <a> no documento analisado
Searching by CSS class or ID find_all(attrs={'class': 'nome_da_classe'}) ou find_all(id='valor_do_id') soup.find_all(attrs={'class': 'titulo'}) retorna tags com class="titulo"
Searching by attribute values find_all(attrs={'atributo': 'valor'}) soup.find_all(attrs={'data-type': 'imagem'}) retorna tags com data-type="imagem"

Extracting Data:

Operation Command/Function Example
Accessing tag contents .contents ou .text tag.contents retorna os filhos diretos da tag
tag.text retorna o texto dentro da tag
Accessing tag attributes .get('atributo') ou .attrs['atributo'] tag.get('href') retorna o valor do atributo href da tag
Extracting text from tags .get_text() ou .text tag.get_text() retorna o texto combinado dentro da tag e seus descendentes

Modifying Content:

Operation Command/Function Example
Modifying tag names .name = 'novo_nome' tag.name = 'h2' altera o nome da tag para <h2>
Modifying tag attributes .attrs['atributo'] = 'novo_valor' tag.attrs['class'] = 'destaque' altera o valor do atributo class da tag
Modifying tag contents .string = 'novo_conteudo' tag.string = 'Olá, Mundo!' altera o conteúdo de texto da tag para 'Olá, Mundo!'

Parsing Options:

Option Command/Function Example
HTML parser options Especificar o parser ao inicializar o Beautiful Soup soup = BeautifulSoup(html_doc, 'lxml') inicializa o Beautiful Soup com o parser 'lxml'

Error Handling:

Feature Command/Function Example
Graceful error handling O Beautiful Soup trata HTML/XML malformados de forma adequada soup = BeautifulSoup(html_malformado, 'html.parser') trata HTML malformado durante o parsing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment