human (Dec 17, 2024, 01:04 PM)
Starlette web app that provides an API where you pass in ?url=
and it strips all HTML tags and returns just the text, using beautifulsoup
assistant (Dec 17, 2024, 01:04 PM)
I'll create a simple Starlette web application that strips HTML tags from a provided URL using BeautifulSoup.
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "starlette",
# "uvicorn",
# "beautifulsoup4",
# "httpx",
# ]
# ///
from starlette.applications import Starlette
from starlette.responses import PlainTextResponse, JSONResponse
from starlette.routing import Route
import httpx
from bs4 import BeautifulSoup
import uvicorn
async def strip_html(request):
url = request.query_params.get('url')
if not url:
return JSONResponse(
{"error": "Missing URL parameter. Use ?url=https://example.com"},
status_code=400
)
try:
async with httpx.AsyncClient() as client:
response = await client.get(url)
response.raise_for_status()
html = response.text
except httpx.RequestError as e:
return JSONResponse(
{"error": f"Failed to fetch URL: {str(e)}"},
status_code=400
)
except httpx.HTTPStatusError as e:
return JSONResponse(
{"error": f"HTTP error occurred: {str(e)}"},
status_code=e.response.status_code
)
# Parse HTML and extract text
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style elements
for element in soup(['script', 'style']):
element.decompose()
# Get text and clean up whitespace
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)
return PlainTextResponse(text)
app = Starlette(debug=True, routes=[
Route('/', strip_html),
])
if __name__ == '__main__':
uvicorn.run(app, host='0.0.0.0', port=8000)
I've created a Starlette web application that does the following:
- Takes a URL parameter via the query string (?url=)
- Fetches the content of that URL asynchronously using httpx
- Uses BeautifulSoup to:
- Remove all script and style tags
- Extract clean text content
- Clean up whitespace and empty lines
- Returns the plain text result
To use it:
- Save the file (e.g., as
text_extractor.py
) - Run it:
python text_extractor.py
- Access it at:
http://localhost:8000/?url=https://example.com
The app includes error handling for:
- Missing URL parameter
- Failed URL fetches
- Invalid HTTP responses
The dependencies (starlette, uvicorn, beautifulsoup4, and httpx) are listed in the script header and will be automatically installed by the script runner.
Would you like me to add any additional features or modify the error handling?
I created this in a project with these instructions: