Skip to content

Instantly share code, notes, and snippets.

@vstakhov
Created December 15, 2016 14:44
Show Gist options
  • Save vstakhov/937f253d5935ee4158688932589b1dcc to your computer and use it in GitHub Desktop.
Save vstakhov/937f253d5935ee4158688932589b1dcc to your computer and use it in GitHub Desktop.

New mime parser algorithm for Rspamd

  1. If we have any complicated (multipart or message) part in the top, then apply the following algorithm. Otherwise goto step 10
  2. Use multipattern scan (hyperscan/aho-corasic) to find all occurences of patterns \r-- and \n--
  3. For each occurence we check sanity:
  • We find the length of substring where there are no \r and \n
  • We check the ending of this substring
  • If this substring is empty, we skip it
  • If this substring ends with -- we assume it closing
  1. For normal boundaries calculate H(boundary), where H is a hash function described below
  2. For closing boundaries calculate H(boundary) AND H(closed_boundary)
  3. When all boundaries are found, perform parsing of message using array of the following structures:
  • Boundary offset
  • Bondary hash (+ closed hash)
  • End of boundary (potential part start)
  • Flags (closing boundary)
  1. For each multipart we apply filter function which selects the boundaries that are related to this multipart
  2. filter function uses hashes to select the required boundaries and ensures that all offsets are enclosed within multipart range
  3. Apply this algorithm for all multiparts (recursing to the nested subparts if needed)
  4. For other parts just parse them normally (using CTE detection + heuristic)

Algorithm advantages

  • Scans text only once when searching for boundaries
  • We do not compare boundaries which might be long strings but use hashes
  • It is easy to check overflows and corner cases
  • It is easy to work with broken mime structures (as no bactracking is needed)
  • Algorithm is very simple (less than 1k LoC)

Algorithm disadvantages

  • Cannot work in streaming mode (BAD)
  • Hash collisions (but see below)

Hash function concerns

We apply siphash with randomly generated key. This key is regenerated after 10k messages are parsed. Hence, the collision probability (both malicious and non-malicious) is negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment