Skip to content

Instantly share code, notes, and snippets.

@mattfysh
Last active July 18, 2021 23:15
Show Gist options
  • Save mattfysh/284f266f09260cb1fdd03dc74fe056e0 to your computer and use it in GitHub Desktop.
Save mattfysh/284f266f09260cb1fdd03dc74fe056e0 to your computer and use it in GitHub Desktop.
Luminati Performance
PROXY_URL=<your_proxy_url>
echo "HTTP"
time (
curl --proxy $PROXY_URL -s -o /dev/null http://example.com
curl --proxy $PROXY_URL -s -o /dev/null http://example.com
curl --proxy $PROXY_URL -s -o /dev/null http://example.com
curl --proxy $PROXY_URL -s -o /dev/null http://example.com
curl --proxy $PROXY_URL -s -o /dev/null http://example.com
)
echo "HTTP with connection reuse"
time (
curl --proxy $PROXY_URL -s -o /dev/null http://example.com \
-: --proxy $PROXY_URL -s -o /dev/null http://example.com \
-: --proxy $PROXY_URL -s -o /dev/null http://example.com \
-: --proxy $PROXY_URL -s -o /dev/null http://example.com \
-: --proxy $PROXY_URL -s -o /dev/null http://example.com
)
echo "HTTPS"
time (
curl --proxy $PROXY_URL -s -o /dev/null https://example.com
curl --proxy $PROXY_URL -s -o /dev/null https://example.com
curl --proxy $PROXY_URL -s -o /dev/null https://example.com
curl --proxy $PROXY_URL -s -o /dev/null https://example.com
curl --proxy $PROXY_URL -s -o /dev/null https://example.com
)
echo "HTTPS with connection reuse"
time (
curl --proxy $PROXY_URL -s -o /dev/null https://example.com \
-: --proxy $PROXY_URL -s -o /dev/null https://example.com \
-: --proxy $PROXY_URL -s -o /dev/null https://example.com \
-: --proxy $PROXY_URL -s -o /dev/null https://example.com \
-: --proxy $PROXY_URL -s -o /dev/null https://example.com
)
import http from 'http'
import proxy from './proxy'
export default class HttpProxyAgent extends http.Agent {
addRequest(req, options) {
req.setHeader(
'Proxy-Authorization',
`Basic ${Buffer.from(proxy.auth).toString('base64')}`
)
req.setHeader('Proxy-Connection', 'Keep-Alive')
req.path = `http://${options.host}${req.path}`
const forwardingOptions = {
...options,
host: 'zproxy.lum-superproxy.io',
port: 22225,
}
return super.addRequest(req, forwardingOptions)
}
}
@mattfysh
Copy link
Author

mattfysh commented Jul 18, 2021

"Target domain" = the origin of the website you're scraping

  • Instrument your code and measure performance before making changes to your networking layer, to understand what impact your changes have made. I use AWS X-Ray, another good option is New Relic. You're looking for "APM solutions with tracing"
  • The biggest gain will come from reusing connections as much as possible. This avoids costly routing/handshakes/etc that occur when establishing a new connection
  • HTTPS proxying works using tunnelling
    • The connections can only be reused for the same target domain. e.g. a connection (socket) used to scrape from https://a.com cannot be reused to scrape from https://b.com
    • You're at the mercy of the target domain - their server may not have enabled keepalive connections, and their server will ultimately decide how long they want to keep a TCP connection alive
  • HTTP proxying works using forwarding
    • The connection is terminated at the proxy and kept open as long as possible
    • In contrast to HTTPS, the connection pool can be reused for any target domain
  • For understandings sake, you can add the --verbose flag to curl in the commands above to see the client/server messaging
  • Examine the x-luminati-timeline response header to understand where time is being spent on your remote proxies. You could experiment with different settings to see if this can be improved, e.g. selecting a region for a proxy that is physically closer to the target domain
  • Try to initiate the connection process ASAP on an incoming request. e.g. if you're scraping https://twitter.com, begin opening a connection straight away (if a reusable one doesn't exist), while other pre-scraping tasks are being performed, e.g. Auth, database requests
  • With APM tracing in place, you'll see other areas in your process that could be improved, e.g.
  • Batch your database reads as much as possible, and implement caching to avoid hits on the database. Be careful about which TTL you use for your cache entries so that you don't have to wait too long for the cache to update after making a change (unless you want to do custom cache invalidation, which may be more trouble than it's worth)

Node.js developers

  • Use https://github.com/mknj/node-keepalive-proxy-agent for an HTTPS agent
  • Use the custom code above for an HTTP agent
  • Don't forget to use options: keepAlive=true, keepAliveMsecs=5000 (this is the TCP keepalive function, distinctly separate from HTTP keepalive)
  • Run your code with NODE_DEBUG=net to verify that your connections are actually being reused. Look for the "createConnection" log line
  • You can also check the req.reusedSocket flag to see if the request was sent over an existing connection

Todo

  • Does using proxy-level sessions improve performance, i.e. adding -session-<id> to your proxy username, as documented here: https://brightdata.com/faq#integration-persistence
  • What techniques can be used to keep connections open for as long as possible, e.g. TCP keepalive versus issuing HTTP requests for /favicon.ico at 45-second intervals? How to determine whether this activity allows the target to detect abnormal behaviour?
  • Look into IP caching for zproxy.lum-superproxy.io
  • In forwarding mode, does the proxy server also keep alive connections to outbound target domains?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment