Skip to content

Instantly share code, notes, and snippets.

@anapaulagomes
Created October 11, 2020 15:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anapaulagomes/04bf0f1ca8c3ba5c2b13a50ef15fbc37 to your computer and use it in GitHub Desktop.
Save anapaulagomes/04bf0f1ca8c3ba5c2b13a50ef15fbc37 to your computer and use it in GitHub Desktop.
Find valid redirects (Scrapy)
import scrapy
class ImprensaOficialSpider(scrapy.Spider):
start_urls = ["http://www.imprensaoficial.org/acesso.htm"]
# example http://pmameliarodriguesba.imprensaoficial.org/
name = "imprensa_oficial"
TERRITORY_ID = None
handle_httpstatus_list = [301]
def parse(self, response):
# find cities
cities = response.css("option::attr(value)").extract()
for city in cities:
yield scrapy.Request(
f"http://pm{city.replace('.', '')}.imprensaoficial.org",
callback=self.find_cities,
dont_filter=True,
meta={"city": city}
)
def find_cities(self, response):
print(f"------------------> {response.status} {response.url}")
if response.status == 301:
yield {"url": response.url}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment