Skip to content

Instantly share code, notes, and snippets.

@samthetechie
Last active December 22, 2017 09:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samthetechie/5b8659dcd541761f473730cedf6aaed5 to your computer and use it in GitHub Desktop.
Save samthetechie/5b8659dcd541761f473730cedf6aaed5 to your computer and use it in GitHub Desktop.
Simple Scraper written in Python using Requests. Randomised User Agent / request Requests through a socket / Tor socks proxy.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
"""
Simple Scraper written using Requests.
Randomised User Agent / request
Requests through a socket / Tor socks proxy.
"""
from fake_useragent import UserAgent
import socks
import socket
import requests
# import time
# From bash, start tor to ru in background
# $ tor &
# by default tor uses port 9050, to be sure
# check netstat to see if the process is listening on that port
# $ netstat -tupln
socks.setdefaultproxy(proxy_type=socks.PROXY_TYPE_SOCKS5, addr="127.0.0.1",
port=9050)
# socks.socksocket returns a socket object which is assigned to socket.socket
# which opens a socket. Now all connections made by the script
# will be done using this socket.
socket.socket = socks.socksocket
start_id = 1 # you could use these variables in accessing sequential records
stop_id = 5 # i.e. this is just as an example for how to iterate
ua = UserAgent()
# call ua.random in the loop to randomise the user agent header per request
for id in range(start_id, stop_id):
# start = time.time()
headers = {
'User-Agent': ua.random
}
# print headers
outfile = "response.txt"
url = "http://icanhazip.com" # response.txt should contain tor ip address
with open(outfile, "w") as file:
file.write(requests.get(url, headers=headers).text)
# roundtrip = time.time() - start
# print roundtrip
"""
Pre-flight checks:
1) Requests made over tor?
watch encrypted traffic: sudo tcpdump -vv -x -X -i lo 'port 9050'
2) Check roundtrip time: (how many requests/s? Be polite / don't slam server)
3) Check ua randomisation is working?
4) After a few initial tests does response.txt have the kind of data we want?
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment