I thought I'd just share how I'm getting socks support with scrapy. Basically there are two pretty good options, DeleGate and Privoxy. I'm going to give an example of a middleware that I implemented using DeleGate which has worked great for me thus far.
DeleGate is amazingly simple and straightforward; it's basically serving as an http-to-socks bridge. In other words, you make a request to it with scrapy as if it were an http proxy and it will take care of bridging that over to the socks server. Privoxy can do this too, but it seems like DeleGate has much better documentation and possibly more functionality than Privoxy (maybe...) You can either build from source or download a pre-built binary (supports Windows, MacOS X, Linux, BSD, and Solaris). Set it up however you like so that it's on your PATH. In my Ubuntu setup I simply created a symbolic link to the binary in my /usr/bin directory. Copying it over there works too. So after it's installed, try running this in your shell:
delegated ADMIN=<whatever-you-want> RESOLV="" -Plocalhost:<localport> SERVER=http SOCKS=<socks-server-address>:<socks-server-port>
This should setup a proxy server on the local machine. A brief explanation of some of the options:
ADMIN
- this can be whatever. Ideally it should be an email address to display should the DeleGate server run into a problem.
RESOLV
- I forget exactly what this was doing, something to do with DNS resolution. Basically, if I didn't include this argument and set it to an empty string, I noticed I was inadvertently exposing my IP while testing against my dev server. (You may or may not need this, I suspect I needed it because I have a public DNS A record pointing to the particular machine I was testing DeleGate on)
-P[localhost]:localport
- the address and port of the local DeleGate proxy server which will run. You can just set an arbitrary port.
SERVER
- the protocol of the local DeleGate proxy server. In this case, we want HTTP because that's what scrapy is compatible with
SOCKS
- the address and port of the socks proxy server that DeleGate will "bridge" the request to.
To shut down gracefully, you can run this command in a separate window:
delegated -P[localhost]:<localport> -Fkill
Keep in mind that this is setting up a live proxy server running on localhost. While testing I was able to access the delegate web interface through my browser. Make sure that either your firewall is setup accordingly or read the docs on setting up auth/security lest you want people like me finding it and using it.
So to make this integrate nicely with scrapy, I wrote a middleware. Here's a watered down version of it:
from my_scrapy_project.util.proxy_manager import Proxy, ProxyManager
import subprocess
class CustomProxyMiddleware(object):
@staticmethod
def start_delegate(proxy,localport)
cmd = 'delegated ADMIN=nobdoy RESOLV="" -P:%s SERVER=http TIMEOUT=con:15 SOCKS=%s:%s' % (localport, proxy.address, proxy.port)
subprocess.Popen(cmd, shell=True)
proxy.address = 'localhost'
proxy.scheme = 'http'
proxy.port = localport
return proxy
@staticmethod
def stop_delegate(localport):
cmd = 'delegated -P:%s -Fkill' % localport
subprocess.Popen(cmd, shell=True)
ProxyManager.release_delegate_port(localport)
def process_request(self, request, spider):
# For simplicity I'm not including code for Proxy or ProxyManager. Should be self explanatory.
proxy = Proxy(ProxyManager.get_socks_proxy_params())
localport = ProxyManager.reserve_delegate_port()
socks_bridge_proxy = CustomProxyMiddleware.start_delegate(proxy,localport)
request.meta['proxy'] = socks_bridge_proxy.to_string()
request.meta['delegate_port'] = localport
def process_response(self, request, response, spider):
# handle response logic here
# check if there is a delegate instance running for this request
if 'delegate_port' in request.meta:
CustomProxyMiddleware.stop_delegate(request.meta['delegate_port'])
def process_exception(self, request, exception, spider):
# handle exceptions here