Skip to content

Instantly share code, notes, and snippets.

View duggalrahul's full-sized avatar

Rahul Duggal duggalrahul

View GitHub Profile
@duggalrahul
duggalrahul / Crawler
Created September 13, 2013 09:41
This is a basic python based web crawler. I have employed breadth first search to go through web pages. A simple regular expression was used to extract http and https hyperlinks from the source code of a web page. Built in python 2.7.5. Just change the start_link variable to the link of the web page from where you want to begin crawling.
import re
import urllib2
from sets import Set
start_link = 'http://precog.iiitd.edu.in/'
urls = Set([start_link])
def findId(source):
l = re.findall(r'"(http[s]*://\S+)"',source)
return l