Last active
December 22, 2017 09:50
-
-
Save samthetechie/5b8659dcd541761f473730cedf6aaed5 to your computer and use it in GitHub Desktop.
Simple Scraper written in Python using Requests. Randomised User Agent / request Requests through a socket / Tor socks proxy.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python | |
# -*- coding: UTF-8 -*- | |
""" | |
Simple Scraper written using Requests. | |
Randomised User Agent / request | |
Requests through a socket / Tor socks proxy. | |
""" | |
from fake_useragent import UserAgent | |
import socks | |
import socket | |
import requests | |
# import time | |
# From bash, start tor to ru in background | |
# $ tor & | |
# by default tor uses port 9050, to be sure | |
# check netstat to see if the process is listening on that port | |
# $ netstat -tupln | |
socks.setdefaultproxy(proxy_type=socks.PROXY_TYPE_SOCKS5, addr="127.0.0.1", | |
port=9050) | |
# socks.socksocket returns a socket object which is assigned to socket.socket | |
# which opens a socket. Now all connections made by the script | |
# will be done using this socket. | |
socket.socket = socks.socksocket | |
start_id = 1 # you could use these variables in accessing sequential records | |
stop_id = 5 # i.e. this is just as an example for how to iterate | |
ua = UserAgent() | |
# call ua.random in the loop to randomise the user agent header per request | |
for id in range(start_id, stop_id): | |
# start = time.time() | |
headers = { | |
'User-Agent': ua.random | |
} | |
# print headers | |
outfile = "response.txt" | |
url = "http://icanhazip.com" # response.txt should contain tor ip address | |
with open(outfile, "w") as file: | |
file.write(requests.get(url, headers=headers).text) | |
# roundtrip = time.time() - start | |
# print roundtrip | |
""" | |
Pre-flight checks: | |
1) Requests made over tor? | |
watch encrypted traffic: sudo tcpdump -vv -x -X -i lo 'port 9050' | |
2) Check roundtrip time: (how many requests/s? Be polite / don't slam server) | |
3) Check ua randomisation is working? | |
4) After a few initial tests does response.txt have the kind of data we want? | |
""" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment