Skip to content

Instantly share code, notes, and snippets.

@robertknight
Last active July 8, 2018 22:31
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save robertknight/2145c99ed8209ebb1b81 to your computer and use it in GitHub Desktop.
Save robertknight/2145c99ed8209ebb1b81 to your computer and use it in GitHub Desktop.
PyWB live rewriting notes

pywb implementation notes

Server-side components

  • Static server-side rewriting implemented by classes in pywb/rewrite/*

    html_rewriter.py

    • Takes an HTML feed as input via feed() method, the handler_*() methods are called as the HTML is parsed. These rewrite attribute tags and the rewritten result is then output.

    url_rewriter.py

    • UrlRewriter is constructed with a set of config rules, the rewrite() method takes an input URL and rewrites it according to those rules.

    regex_rewriter.py

    • Rewrites accesses of variables named "location" or "domain" to a proxy document.WB_wombat_location object in scripts
    • Rewrites JS links
    • Rewrites @import and url() in CSS to proxied locations. Done with a regex and consequently limited

    cookie_rewriter.py

    header_rewriter.py

    • Rewrites cookie headers using cookie_rewriter.py
    • Removes security headers that may interfere with loading of page content at a different URL (eg. Content-Security-Policy)
    • Rewrites URLs in redirect headers (eg. Location)
    • Replaces cache headers with new ones that depend on the caching settings that the rewriter is configured with. The default policy is to disable caching
    • Modifies headers for content length and encoding to account for any changes made by the proxy
  • Location changes

  • js_rewrite_location rule specifies how JS is processed. Modes are "location" and "urls" and "all". If "location", references to any identifier named "location" are rewritten. If "urls", any absolute URLs in strings are rewritten. If "all", both are rewritten.

    pywb/rules.yaml defines a set of rules for various major websites

  • References to 'location' in scripts are replaced with references to WB_wombat_location which can then intercept the location change and set the real location.href to a proxied URL

    Matching is done with a regex so it ends up replacing the word "location" or "domain" in strings.

    r'(?:(?<=["';])https?:|(?<=["']))\{0,4}/\{0,4}/[A-Za-z0-9:_@%.\-]+/'

  • The JS location hack tries to work around a browser security feature that makes 'document.location' unforgeable. See https://lists.w3.org/Archives/Public/public-script-coord/2012JulSep/0144.html for history.

Client-side components

  • wb.js: Manages the banner which is inserted into the page by the proxy. The "banner" can also include elements such as the sidebar.

  • vidrw.js: Rewrites video elements from major video providers (eg. YouTube)

  • WombatJS: Monkey-patches JavaScript on the page in order to handle AJAX requests, rewrite attributes of dynamically created tags.

    Parts include:

    • Functions for rewriting URLs on the client side
    • WombatLocation - A fake implementation of the Location object which writes the URL in assignments to the href property etc.
    • A window.history override which updates the root URL for the page when the location is changed via the history API
    • An XMLHttpRequest override which overrides the open() method to rewrite the URL
    • A baseURI override which makes it appear to JS code reading it as if the base URI is the original URL rather than the proxy URL (eg. document.baseURI on https://via.hypothes.is/$URL returns $URL)
    • Overrides for Element.(setAttribute|getAttribute). The setAttribute() override rewrites URLs passed to 'src', 'href' attrs and rewrites CSS in inline styles. The getAttribute() override does the opposite rewriting (ie. proxied URL -> original URL)
    • Overrides createElement() to modify the 'action' attr for form elements.
    • Overrides the Date constructor so that when revisiting an archived page, it behaves as if the user was visiting at the time they originally visited the page.
    • Breaks WebWorkers by setting window.Worker to undefined
    • Overrides various properties which can be set to HTML strings (eg. innerHTML) to rewrite content in the dynamically added HTML
    • Overrides window.postMessage() to rewrite the origin argument (otherwise the message would fail to be delivered). Listeners for the 'message' event are also overwritten to perform the opposite rewriting (transform target origin from proxied URL -> URL)
    • Overrides window.open() to open a rewritten URL
    • Overrides document.cookie to perform client-side rewriting similar to what cookie_rewriter.py does on the server. (cf. the Genius proxy which overrides document.cookie to make assignments no-ops)
    • Installs WombatJS into client <iframe>s
    • Overrides the 'action' attribute of all forms on the page
    • Overrides window.navigator.registerProtocolHandler() to rewrite the handler URL
    • There are big chunks of commented-out code which set up MutationObserver, presumably for dynamic HTML rewriting
@judell
Copy link

judell commented Nov 25, 2015

Thanks Robert! Great to have all this summarized.

@seanh
Copy link

seanh commented Nov 26, 2015

Thanks @robertknight, very interesting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment