Skip to content

Instantly share code, notes, and snippets.

@MostAwesomeDude
Created January 29, 2017 23:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MostAwesomeDude/d47bb16720cb7b3a1e47dede2951eb85 to your computer and use it in GitHub Desktop.
Save MostAwesomeDude/d47bb16720cb7b3a1e47dede2951eb85 to your computer and use it in GitHub Desktop.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
class SiegeSpider(scrapy.Spider):
name = "siege"
def __init__(self, domain, *args, **kwargs):
super(SiegeSpider, self).__init__(*args, **kwargs)
urlBase = 'https://%s/' % domain
self.le = LinkExtractor(allow=[urlBase])
self.allowed_domains = [domain]
self.start_urls = (urlBase,)
def parse(self, response):
links = self.le.extract_links(response)
for link in links:
url = link.url
yield {"url": url}
yield scrapy.Request(url, callback=self.parse)
@MostAwesomeDude
Copy link
Author

This is a quick and dirty link harvester, taking a single domain argument and returning URLs.

An easy way to prepare a siege urls.txt for a domain:

$ scrapy runspider siege.py -a domain=matador.cloud -t json -o - 2> /dev/null | jq -r '.[] | .url'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment