Skip to content

Instantly share code, notes, and snippets.

@simonw
Last active December 22, 2024 02:52
Show Gist options
  • Save simonw/08957a1490ebde1ea38b4a8374989cf8 to your computer and use it in GitHub Desktop.
Save simonw/08957a1490ebde1ea38b4a8374989cf8 to your computer and use it in GitHub Desktop.

HTML Tag Stripper API with Starlette and BeautifulSoup

human (Dec 17, 2024, 01:04 PM)

Starlette web app that provides an API where you pass in ?url= and it strips all HTML tags and returns just the text, using beautifulsoup

assistant (Dec 17, 2024, 01:04 PM)

I'll create a simple Starlette web application that strips HTML tags from a provided URL using BeautifulSoup.

create HTML Text Extractor API

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "starlette",
#     "uvicorn",
#     "beautifulsoup4",
#     "httpx",
# ]
# ///

from starlette.applications import Starlette
from starlette.responses import PlainTextResponse, JSONResponse
from starlette.routing import Route
import httpx
from bs4 import BeautifulSoup
import uvicorn

async def strip_html(request):
    url = request.query_params.get('url')
    
    if not url:
        return JSONResponse(
            {"error": "Missing URL parameter. Use ?url=https://example.com"},
            status_code=400
        )
    
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get(url)
            response.raise_for_status()
            html = response.text
    except httpx.RequestError as e:
        return JSONResponse(
            {"error": f"Failed to fetch URL: {str(e)}"},
            status_code=400
        )
    except httpx.HTTPStatusError as e:
        return JSONResponse(
            {"error": f"HTTP error occurred: {str(e)}"},
            status_code=e.response.status_code
        )
    
    # Parse HTML and extract text
    soup = BeautifulSoup(html, 'html.parser')
    
    # Remove script and style elements
    for element in soup(['script', 'style']):
        element.decompose()
    
    # Get text and clean up whitespace
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    text = '\n'.join(line for line in lines if line)
    
    return PlainTextResponse(text)

app = Starlette(debug=True, routes=[
    Route('/', strip_html),
])

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=8000)

I've created a Starlette web application that does the following:

  1. Takes a URL parameter via the query string (?url=)
  2. Fetches the content of that URL asynchronously using httpx
  3. Uses BeautifulSoup to:
    • Remove all script and style tags
    • Extract clean text content
    • Clean up whitespace and empty lines
  4. Returns the plain text result

To use it:

  1. Save the file (e.g., as text_extractor.py)
  2. Run it: python text_extractor.py
  3. Access it at: http://localhost:8000/?url=https://example.com

The app includes error handling for:

  • Missing URL parameter
  • Failed URL fetches
  • Invalid HTTP responses

The dependencies (starlette, uvicorn, beautifulsoup4, and httpx) are listed in the script header and will be automatically installed by the script runner.

Would you like me to add any additional features or modify the error handling?

# /// script
# requires-python = ">=3.12"
# dependencies = [
# "starlette",
# "uvicorn",
# "beautifulsoup4",
# "httpx",
# ]
# ///
from starlette.applications import Starlette
from starlette.responses import PlainTextResponse, JSONResponse
from starlette.routing import Route
import httpx
from bs4 import BeautifulSoup
import uvicorn
async def strip_html(request):
url = request.query_params.get('url')
if not url:
return JSONResponse(
{"error": "Missing URL parameter. Use ?url=https://example.com"},
status_code=400
)
try:
async with httpx.AsyncClient() as client:
response = await client.get(url)
response.raise_for_status()
html = response.text
except httpx.RequestError as e:
return JSONResponse(
{"error": f"Failed to fetch URL: {str(e)}"},
status_code=400
)
except httpx.HTTPStatusError as e:
return JSONResponse(
{"error": f"HTTP error occurred: {str(e)}"},
status_code=e.response.status_code
)
# Parse HTML and extract text
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style elements
for element in soup(['script', 'style']):
element.decompose()
# Get text and clean up whitespace
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)
return PlainTextResponse(text)
app = Starlette(debug=True, routes=[
Route('/', strip_html),
])
if __name__ == '__main__':
uvicorn.run(app, host='0.0.0.0', port=8000)
@simonw
Copy link
Author

simonw commented Dec 17, 2024

I created this in a project with these instructions:

You write Python tools as single files. They always start with this comment:
```python
# /// script
# requires-python = ">=3.12"
# ///
```
These files can include dependencies on libraries such as Click. If they do, those dependencies are included in a list like this one in that same comment (here showing two dependencies):
```python
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "click",
#     "sqlite-utils",
# ]
# ///

@simonw
Copy link
Author

simonw commented Dec 17, 2024

Project as JSON:

{
    "uuid": "4310fada-3217-46b7-8ab0-4877bd8e0ae5",
    "name": "Python app",
    "description": "",
    "is_private": true,
    "creator": {
        "uuid": "2a0434bc-651f-407a-bc74-bdc295dc009b",
        "full_name": "Simon Willison"
    },
    "is_starred": false,
    "is_starter_project": false,
    "is_harmony_project": false,
    "archiver": null,
    "archived_at": null,
    "created_at": "2024-12-17T21:01:57.035217Z",
    "updated_at": "2024-12-17T21:03:45.359723Z",
    "prompt_template": "You write Python tools as single files. They always start with this comment:\n```python\n# /// script\n# requires-python = \">=3.12\"\n# ///\n```\nThese files can include dependencies on libraries such as Click. If they do, those dependencies are included in a list like this one in that same comment (here showing two dependencies):\n```python\n# /// script\n# requires-python = \">=3.12\"\n# dependencies = [\n#     \"click\",\n#     \"sqlite-utils\",\n# ]\n# ///",
    "docs_count": 0
}

@christopherwoodall
Copy link

Really enjoyed the write up. I tried to implement it as a sort of "code interpreter" here. Works well when the model corporates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment