Skip to content

Instantly share code, notes, and snippets.

@sidharthshah
Created August 7, 2019 10:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sidharthshah/26a35efee10e844ed1793b8c5f46b6c3 to your computer and use it in GitHub Desktop.
Save sidharthshah/26a35efee10e844ed1793b8c5f46b6c3 to your computer and use it in GitHub Desktop.
Python Crawling Example
import re
import ssl
from urllib import request
seedlist = ['https://scrapy.org/']
def extract_urls(url):
"""
this function is used to extract URLs from HTML
"""
results = []
with request.urlopen(url, context=ssl._create_unverified_context()) as response:
html = str(response.read())
for candidate in re.findall(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", html):
results.append(candidate)
return results
while len(seedlist) > 0:
url = seedlist.pop()
extracted_links = extract_urls(url)
print(extracted_links)
seedlist.extend(extracted_links)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment