Skip to content

Instantly share code, notes, and snippets.

@RylandDeGregory
Created April 1, 2024 14:31
Show Gist options
  • Save RylandDeGregory/314aab8ff79dd2bba478c7edd8be6cec to your computer and use it in GitHub Desktop.
Save RylandDeGregory/314aab8ff79dd2bba478c7edd8be6cec to your computer and use it in GitHub Desktop.
Parse all text content from an HTML document using BeautifulSoup.
from bs4 import BeautifulSoup
# path to the local file
doc = "C:\\Users\\Ryland\\Desktop\\index.html"
# open the file in read mode
with open(doc, 'r') as f:
content = f.read()
# Initialize the object with the document
soup = BeautifulSoup(content, "html.parser")
# Get the whole body tag
tag = soup.body
# Open a file for writing
with open("C:\\Users\\Ryland\\Desktop\\index.txt", "w") as file:
# Write each string recursively to the file
for string in tag.strings:
file.write(string + "\n")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment