Skip to content

Instantly share code, notes, and snippets.

@hepplerj
Created March 30, 2023 15:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hepplerj/46e5c6f81007a6288546378b2ea9e50a to your computer and use it in GitHub Desktop.
Save hepplerj/46e5c6f81007a6288546378b2ea9e50a to your computer and use it in GitHub Desktop.
Simple Python scraper for the Internet Archive text
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
url = "https://archive.org/stream/ahandbooksocial00blisgoog/ahandbooksocial00blisgoog_djvu.txt"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
pre_selector = "#maincontent > div > pre"
pre_element = soup.select_one(pre_selector)
if pre_element :
text = pre_element.text.strip()
with open("scraped.txt", "w", encoding="utf-8") as f:
f.write(text)
print("Text written to file.")
else:
print(f"Could not find element with selector {pre_selector}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment