Skip to content

Instantly share code, notes, and snippets.

@cydu
Created June 15, 2014 03:26
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cydu/8a4b9855c5e21423c9c5 to your computer and use it in GitHub Desktop.
Save cydu/8a4b9855c5e21423c9c5 to your computer and use it in GitHub Desktop.
DOWNLOAD_HANDLERS = {
'http': 'myspider.socks5_http.Socks5DownloadHandler',
'https': 'myspider.socks5_http.Socks5DownloadHandler'
}
from txsocksx.http import SOCKS5Agent
from twisted.internet import reactor
from scrapy.xlib.tx import TCP4ClientEndpoint
from scrapy.core.downloader.webclient import _parse
from scrapy.core.downloader.handlers.http11 import HTTP11DownloadHandler, ScrapyAgent
class Socks5DownloadHandler(HTTP11DownloadHandler):
def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
agent = ScrapySocks5Agent(contextFactory=self._contextFactory, pool=self._pool)
return agent.download_request(request)
class ScrapySocks5Agent(ScrapyAgent):
def _get_agent(self, request, timeout):
bindAddress = request.meta.get('bindaddress') or self._bindAddress
proxy = request.meta.get('proxy')
if proxy:
_, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
_, _, host, port, proxyParams = _parse(request.url)
proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
timeout=timeout, bindAddress=bindAddress)
agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
return agent
return self._Agent(reactor, contextFactory=self._contextFactory,
connectTimeout=timeout, bindAddress=bindAddress, pool=self._pool)
@bufrr
Copy link

bufrr commented Sep 10, 2015

i use your code and a problem came out
scrapy's output is

File "/home/adam/.virtualenvs/scrapyenv/local/lib/python2.7/site-packages/ometa/interp.py", line 499, in err
    raise e
ParseError: 
<html>
^
Parse error at line 1, column 0: expected the character '\x05'. trail: []

do u have any idea to help me fix it ?
thank u very much!

@davidblus
Copy link

Thanks for your code at first.
It's good for the http request, but it doesn't work for https.
Then I follow the steps which scrapy make when it get some https site.
At last, this code can be updated by adding contextFactory=ScrapyClientContextFactory() to SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint). Then it will be good for the https request.

socks5_http.py
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory

line 24
agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint, contextFactory=ScrapyClientContextFactory())

Btw, It's not good for socks4. Do not use SOCKS4Agent until you fix some bugs from txsocksx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment