Skip to content

Instantly share code, notes, and snippets.

@dchrostowski
Created June 13, 2017 05:29
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dchrostowski/662dff51e3f49dc668e252a3951b1248 to your computer and use it in GitHub Desktop.
Save dchrostowski/662dff51e3f49dc668e252a3951b1248 to your computer and use it in GitHub Desktop.
socks proxy middleware example for scrapy with DeleGate

I thought I'd just share how I'm getting socks support with scrapy. Basically there are two pretty good options, DeleGate and Privoxy. I'm going to give an example of a middleware that I implemented using DeleGate which has worked great for me thus far.

DeleGate is amazingly simple and straightforward; it's basically serving as an http-to-socks bridge. In other words, you make a request to it with scrapy as if it were an http proxy and it will take care of bridging that over to the socks server. Privoxy can do this too, but it seems like DeleGate has much better documentation and possibly more functionality than Privoxy (maybe...) You can either build from source or download a pre-built binary (supports Windows, MacOS X, Linux, BSD, and Solaris). Set it up however you like so that it's on your PATH. In my Ubuntu setup I simply created a symbolic link to the binary in my /usr/bin directory. Copying it over there works too. So after it's installed, try running this in your shell:

delegated ADMIN=<whatever-you-want> RESOLV="" -Plocalhost:<localport> SERVER=http SOCKS=<socks-server-address>:<socks-server-port>

This should setup a proxy server on the local machine. A brief explanation of some of the options:

ADMIN - this can be whatever. Ideally it should be an email address to display should the DeleGate server run into a problem.

RESOLV - I forget exactly what this was doing, something to do with DNS resolution. Basically, if I didn't include this argument and set it to an empty string, I noticed I was inadvertently exposing my IP while testing against my dev server. (You may or may not need this, I suspect I needed it because I have a public DNS A record pointing to the particular machine I was testing DeleGate on)

-P[localhost]:localport - the address and port of the local DeleGate proxy server which will run. You can just set an arbitrary port.

SERVER - the protocol of the local DeleGate proxy server. In this case, we want HTTP because that's what scrapy is compatible with

SOCKS - the address and port of the socks proxy server that DeleGate will "bridge" the request to.

To shut down gracefully, you can run this command in a separate window:

delegated -P[localhost]:<localport> -Fkill

Keep in mind that this is setting up a live proxy server running on localhost. While testing I was able to access the delegate web interface through my browser. Make sure that either your firewall is setup accordingly or read the docs on setting up auth/security lest you want people like me finding it and using it.

So to make this integrate nicely with scrapy, I wrote a middleware. Here's a watered down version of it:

from my_scrapy_project.util.proxy_manager import Proxy, ProxyManager
import subprocess

class CustomProxyMiddleware(object):
    
    @staticmethod
    def start_delegate(proxy,localport)
        cmd = 'delegated ADMIN=nobdoy RESOLV="" -P:%s SERVER=http TIMEOUT=con:15 SOCKS=%s:%s' % (localport, proxy.address, proxy.port)
        subprocess.Popen(cmd, shell=True)
        proxy.address = 'localhost'
        proxy.scheme = 'http'
        proxy.port = localport

        return proxy

    @staticmethod
    def stop_delegate(localport):
        cmd = 'delegated -P:%s -Fkill' % localport
        subprocess.Popen(cmd, shell=True)
        ProxyManager.release_delegate_port(localport)
        
    def process_request(self, request, spider):
        # For simplicity I'm not including code for Proxy or ProxyManager.  Should be self explanatory.
        proxy = Proxy(ProxyManager.get_socks_proxy_params())
        localport = ProxyManager.reserve_delegate_port()
        socks_bridge_proxy = CustomProxyMiddleware.start_delegate(proxy,localport)
        request.meta['proxy'] = socks_bridge_proxy.to_string()
        request.meta['delegate_port'] = localport

    def process_response(self, request, response, spider):
        # handle response logic here

        # check if there is a delegate instance running for this request
        if 'delegate_port' in request.meta:
            CustomProxyMiddleware.stop_delegate(request.meta['delegate_port'])

    def process_exception(self, request, exception, spider):
        # handle exceptions here
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment