brianhempel/DEFLATE_BREACH.md

## DEFLATE_BREACH.md

      
    Raw
  

              DEFLATE_BREACH.md
            
          
    DEFLATE BREACH

DEFLATE BREACH is a proposed DEFLATE compressor designed to mitigate BREACH attacks with only a moderate loss in compression ratio.
Contents


Introduction
LZ77 Matching
Huffman Coding
Worst Case Expansion
Compression Ratio Loss

Introduction

The BREACH attack demonstrates how HTTP response compression may leak secrets in the response, even when the response is encrypted.
An attacker may discover secrets on an encrypted page if:

The attacker is able to issue requests as a target user, such as in an iframe on a compromised website.
The attacker is able to control text somewhere in the response, e.g. the attacker might request /search?q=owls knowing the response will contain You searched for "owls".
The attacker is able to observe the length of responses, even if hidden by encryption.
The responses are compressed.

HTTP compression uses DEFLATE. DEFLATE itself is a particular implementation of LZ77 matching followed by Huffman coding. See Wikipedia for a good introduction to LZ77 and Huffman coding, and RFC1951 for the specifics of DEFLATE.
The attacker issues requests, varying the input until the input matches some secret elsewhere on the page. When the input matches part of the secret, DEFLATE takes advantage of the redundancy and decreases the page size. When the attacker observers the page size decrease, the attacker knows they made a correct guess. Thus the attacker can guess the secret, character by character.
A complete description of the BREACH attack may be found at http://breachattack.com/.
The worst case scenario is when a known, nearly static page varies only with the attacker's injection and the target secret. In this case, even if the compressor did no LZ77 matching, the attacker might be able to vary their chosen plaintext to gain information about the character frequencies in the document, as the attacker's injection of characters can change the Huffman tree used to encode the document. If the page is essentially static, so that only the secret is unknown to the attacker, the attacker may be able to determine the character frequencies of the secret. Lower entropy secrets, such as names and email addresses and probably certain bank account numbers, are at risk.
Since compression is the problem, the surest way to mitigate the attack is to disable compression. However, HTTP compression typically offers space savings of 60% or more, which translates into faster response times for users. Other techniques to avoid attack, such as encoding secrets with one-time pads that vary on every request or compressing user input and secrets in separate compression contexts, require the application developer to rigorously determine where their secrets are. A technique that requires constant vigilance among all developers on a project is not secure by design.
Could a compressor be produced that, with high probability, is unaffected by secrets while still providing a significant compression ratio?
DEFLATE BREACH is a DEFLATE compressor for web content that ignores the unique, human-oriented information in individual files, while still taking advantage of the large redundancy in HTML, CSS, and Javascript due to the common strings inherent to each language.
DEFLATE BREACH produces DEFLATE compatible compressed data, but uses the extra rules described below to avoid leaking secret information.
LZ77 Matching

As the compressor looks for LZ77 matches, secrets already encoded in the datastream should never be part of a match candidate.
Matches are thus only drawn from:

Bytes numbers unlikely to compose the majority of a secret (such as, say ASCII symbols)
A whitelist of strings common to the data format

The allowed byte numbers and string whitelist depend on the data format being compressed. If the user doesn't specify the format, the format is auto-detected based on the first 100 bytes of uncompressed input after any C-style comments and whitespace.
The supported formats are:

HTML
CSS
Javascript
Undetected

HTML is auto-detected by a doctype, <html, <head, or <body tag.
CSS is auto-detected by discovering a CSS keyword.
Javascript is auto-detected by by a Javascript keyword.
Within HTML, a CSS context may be discovered after the first <head> opening tag and before the first </head> closing tag. The CSS context begins after a <style> opening tag if there is a </style> closing tag before the </head> closing tag. The CSS context continues until the </style> closing tag. Multiple CSS contexts may be present in the head.
The style="" HTML attribute does not trigger a switch to a CSS context.
Javascript contexts may appear anywhere in the HTML. A Javascript context begins after an <script> opening tag only if there is a terminating </script> tag before the next occurrence of <script> or before the end of the document. Thus, an attacker with control of one point in the document cannot force the rest of the document into Javascript mode. An attacker might want to switch to Javascript mode to allow more symbols in matches. However, it may be argued that a website that allows its users to insert an unescaped <script> tag may have bigger problems than BREACH to worry about.
The onclick="" etc. HTML attributes do not trigger a switch to a Javascript context.
Matches are only allowed in the context of the same format. Javascript with Javascript, CSS with CSS, HTML with HTML. HTML after a block of Javascript or CSS is not limited to matching only on the near side of the interruption. HTML may match HTML on the other side of the Javascript or CSS. Similarly, a CSS or Javascript block may match inside a previous block of the same format.
In HTML format, matches are only drawn from:


These bytes unlikely to compose the majority of a secret: whitespace, the characters: <>/="'
HTML tag and attribute names longer than 1 character
The common strings: images http https www. :// .com .net .org <a </a> <p </p>
Any global, user-provided safe strings

In CSS format, matches are only drawn from:


These bytes unlikely to compose the majority of a secret: whitespace, the characters: .#:();{},
CSS attribute names, unit names, value names, and function names longer than 1 character
The common strings: 100% 0px 1px 0em 0 (see note about whitelist string matching below)
Any global, user-provided safe strings

In Javascript format, matches are only drawn from:


These bytes unlikely to compose the majority of a secret: whitespace, the characters: (){}="'[];,.
Javascript keywords longer than 1 character
Javascript operators longer than 1 character
Javascript standard library class names and function names longer than 1 character
DOM standard element names and functions longer than 1 character
3-character sequences consisting of a lone lowercase a, b, e, i, n, t, or x flanked on both sides by different, non-quote, non-whitespace allowed characters. These are very common in minified JS.
Any global, user-provided safe strings

In Undetected mode, matches are only drawn from:


Any global, user-provided safe strings (strings may be only a single character, effectively allowing that character in all matches if the character is non-alphanumeric; see note below)

When the format is undetected, matching does not occur if the user does not provide a list of safe strings.
Further rules for whitelisted strings

For all formats, whitelisted strings that begin with an alphanumeric character may only participate in a match if the character immediately previous to the string is non-alphanumeric. Similarly, whitelisted strings that end in an alphanumeric character may only participate in a match if the character immediately following the string is non-alphanumeric. These previous or following characters are not required to participate in the match. The rule applies only to the match source, not the incoming data to be encoded.  For example, in a Javascript context the word "functional" might be encoded by matching a previous occurrence of the Javascript keyword "function", but the word "functional" itself could not participate in any future matches: a second occurrence of the word "functional" could not match the first "functional", not even the "function" part of the word, it must resort to matching the earlier "function" keyword.
Though we could tighten the rule so that "functional" could not even be encoded by matching the keyword "function", this would not significantly hinder an attacker. The attacker is assumed to be able to use their chosen plaintext to omnisciently discover all potential match sources in the document. We only have to ensure that secrets are never a match source. It does not matter if a valid match source is used to create output that is not a valid match source. When this happens, the pool of match sources that the attacker may query does not increase.
For the same reason, an encoded match may consist of only part of a whitelisted string ("unction" could be coded from a previous "function") as long as the entire whitelisted string, with any required non-alphanumeric characters before or after, appears at the match source.
Huffman Coding

Huffman coding is modified to leak as little information as possible about the character frequencies of any secrets.
Each entire document is Huffman coded using only one of three probability models, as determined by the user:

A non-dynamic, default frequency table provided by DEFLATE BREACH, determined by the main file type (a separate table is provided for each of the three main formats: HTML, JS, and CSS).
Fixed Length Huffman codes (as already defined by zlib). If there was no matching, as when the format is undetected and no safe strings are provided, then some size expansion will occur with fixed length codes.
A user-provided frequency table.

The Huffman coding strategy must not change on every request. Otherwise, an attacker might be able to determine a secret's character frequencies by figuring out which injected characters tip the compressor to use a different strategy.
For the same reason, a user providing a frequency table must not recalculate the frequency table on every request. Instead, the user might:

Calculate the frequency table based only on the first n requests for a resource after server startup, and use that table only after all n requests; n=8 may be sufficient.
Choose to update the frequency table with information from a request only when a random number is less than 1/4^n, where n is the number of previous times the table has been updated since server startup. (By n=10, the table will only be updated once every million requests.)

As a convenience to application implementations, users may query DEFLATE BREACH for the actual symbol frequencies after compression, including the frequencies of the match length and match distance symbols.
DEFLATE allows symbols of 0 frequency to not be given a Huffman code. However, because any given frequency table in DEFLATE BREACH is only a guess at the actual frequencies, DEFLATE BREACH still assigns a (long) Huffman code to "0 frequency" symbols so the compressor does not fail if they do happen to occur.
Worst Case Expansion

Unlike optimal DEFLATE implementations, all information is stored "compressed", even if a size expansion occurs. Thus, DEFLATE BREACH has a greater size expansion in the worst case than optimal DEFLATE. The worst case depends on the longest code in the Huffman tree.
The user, of course, may choose not to use DEFLATE BREACH for certain resources. Again, however, this decision must be made infrequently enough that an attacker cannot effectively control the decision to gain information about the page content. See "Huffman Coding" above for some possible strategies.
Compression Ratio Loss

An initial prototype, with slightly stricter LZ77 matching rules than described above but no change to Huffman coding (meaning the file may be coded in multiple blocks with different Huffman trees—only one tree is allowed in DEFLATE BREACH) yields the following results on an HTML file:
File: The HTML of breachattack.com, with the lines between each <style> and </style> removed.

Uncompressed size:         13383 bytes
Huffman only, no matching: 8556 bytes (36.1% space savings)
DEFLATE BREACH prototype:  6823 bytes (49.0% space savings)
gzip -9:                   4723 bytes (64.7% space savings)