Skip to content

Instantly share code, notes, and snippets.

@duggalrahul
Created September 13, 2013 09:41
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save duggalrahul/6548584 to your computer and use it in GitHub Desktop.
Save duggalrahul/6548584 to your computer and use it in GitHub Desktop.
This is a basic python based web crawler. I have employed breadth first search to go through web pages. A simple regular expression was used to extract http and https hyperlinks from the source code of a web page. Built in python 2.7.5. Just change the start_link variable to the link of the web page from where you want to begin crawling.
import re
import urllib2
from sets import Set
start_link = 'http://precog.iiitd.edu.in/'
urls = Set([start_link])
def findId(source):
l = re.findall(r'"(http[s]*://\S+)"',source)
return l
def get_source(url):
response = urllib2.urlopen(url)
page_source = response.read()
return page_source
def search(source, depth):
if depth==2:
return
print source, depth
try:
page_source = get_source(source)
links = Set(findId(page_source))
except:
print 'some error encountered'
return
global urls
for link in links:
if link not in urls:
urls = urls|Set([link])
for link in urls:
search(link,depth+1)
search(start_link,0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment