Skip to content

Instantly share code, notes, and snippets.

@gersande
Last active August 11, 2020 02:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save gersande/b8a3e6fd93a477d15210851a88b633d8 to your computer and use it in GitHub Desktop.
Save gersande/b8a3e6fd93a477d15210851a88b633d8 to your computer and use it in GitHub Desktop.
A thing that uses Beautiful Soup
#!/usr/bin/env python
import requests
from bs4 import BeautifulSoup
url = "http://www.gersande.com"
response = requests.get(url)
# parse html
page = str(BeautifulSoup(response.content))
def getURL(page):
"""
:param page: html of web page (here: Python home page)
:return: urls in that page
"""
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print(url)
else:
break
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment