Skip to content

Instantly share code, notes, and snippets.

@benoit74
Last active January 1, 2024 11:28
Show Gist options
  • Save benoit74/49cabb570a66f2e8b963bbc984043c6c to your computer and use it in GitHub Desktop.
Save benoit74/49cabb570a66f2e8b963bbc984043c6c to your computer and use it in GitHub Desktop.
Observe Browsertrix Crawler web traffic
version: '3.5'
services:
crawler:
image: webrecorder/browsertrix-crawler:latest
volumes:
- ./output:/output
environment:
- PROXY_HOST=web_proxy
- PROXY_PORT=8080
command:
- crawl
- ... add your parameters here ...
- --cwd
- /output/crawls
- --statsFilename
- /output/crawl.json
- --screencastPort
- "9037"
ports:
- 127.0.0.1:9037:9037
web_proxy:
image: mitmproxy/mitmproxy:latest
command:
- mitmdump
- -w
- /output/mitmdump/dump-%Y-%m-%d-%H-%M
- --flow-detail
- "1"
- --listen-host
- "0.0.0.0"
volumes:
- ./output:/output

One might need to get a better grasp at what Browsertrix is performing in terms of trafic and where it is being blocked.

You might use the following docker compose stack to run:

  • a mitmweb web proxy to intercept all Browsertrix traffic and dump them on file
  • a screencasting Browsertrix crawler to get a grasp on crawler behavior

If you open http://localhost:9037, you will see Browsertrix screencasting of the browser.

mitmdump are placed in the output/mitmdump folder, with one file per minute.

A sample script to process mitmdump and extract response with HTTP 429 status code and find the Retry-After header is in extract_retry_after.py and can be launched with mitmdump -n -r output/mitmdump/dump-xxxx-xx-xx-xx-xx -s extract_retry_after.py --flow-detail 0

from mitmproxy import http
import time,re
import logging
def response(flow: http.HTTPFlow) -> None:
if not flow.response:
return
if flow.response.status_code != 429:
return
if "Retry-After" not in flow.response.headers:
return
print(flow.response.headers["Retry-After"])
# def request(flow: http.HTTPFlow) -> None:
#request_content = flow.request.content
# here u get the request content and then log it and use it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment