vstakhov/mime_parser.md

## mime_parser.md

      
    Raw
  

              mime_parser.md
            
          
    New mime parser algorithm for Rspamd


If we have any complicated (multipart or message) part in the top, then apply the following algorithm. Otherwise goto step 10
Use multipattern scan (hyperscan/aho-corasic) to find all occurences of patterns \r-- and \n--
For each occurence we check sanity:


We find the length of substring where there are no \r and \n
We check the ending of this substring
If this substring is empty, we skip it
If this substring ends with -- we assume it closing


For normal boundaries calculate H(boundary), where H is a hash function described below
For closing boundaries calculate H(boundary) AND H(closed_boundary)
When all boundaries are found, perform parsing of message using array of the following structures:


Boundary offset
Bondary hash (+ closed hash)
End of boundary (potential part start)
Flags (closing boundary)


For each multipart we apply filter function which selects the boundaries that are related to this multipart
filter function uses hashes to select the required boundaries and ensures that all offsets are enclosed within multipart range
Apply this algorithm for all multiparts (recursing to the nested subparts if needed)
For other parts just parse them normally (using CTE detection + heuristic)

Algorithm advantages


Scans text only once when searching for boundaries
We do not compare boundaries which might be long strings but use hashes
It is easy to check overflows and corner cases
It is easy to work with broken mime structures (as no bactracking is needed)
Algorithm is very simple (less than 1k LoC)

Algorithm disadvantages


Cannot work in streaming mode (BAD)
Hash collisions (but see below)

Hash function concerns

We apply siphash with randomly generated key. This key is regenerated after 10k messages are parsed. Hence, the collision probability (both malicious and non-malicious) is negligible.