stvemillertime/gist:18e25231ac4ba26e98dc85414f0f1517

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    This started with a tweet from Steve Miller (https://twitter.com/stvemillertime/status/1508441489923313664) in which he asked what is better for performance: 1 rule with 10k strings or 10k rules with 1 string each? Based upon my understanding of YARA I guessed it wouldn't matter for search time and the difference in bytecode evaluation would be in the noise. Effectively, I guessed you would not be able to tell the difference between the two.
Costin was the first to provide actual results and he claimed a 35 second vs 31 second difference between the two (https://twitter.com/craiu/status/1508445059129163783). That didn't make much sense to me so I asked for his rules so I could test them. He provided me with two rules files (10k.yara and 10kv2.yara) and a text file with a bunch of strings in it.
This is my attempt to replicate his findings and also document why he was getting the warning he was getting. Because I wanted the run to take a bit of time I ended up not using his text file with all the strings (it was a single 410KB file, not nearly enough data to be relevant), so I had to create my own data. In the end I am not able to replicate his findings and my findings line up with what I was expecting in terms of relative performance.
First, my test data:
Make 5000 files each the same size filled with random data. The size is 2.4MB:
wxs@mbp wxs % cd tmp && for f in $(jot 5000 1 5000); do dd if=/dev/random of=$f count=5000; done

This resulted in 12GB worth of random data spread evenly over 5000 files.
Now, the test YARA rules I was using, provided by Costin:
wxs@mbp wxs % head 10k.yara
rule stringstest {

strings:

        $a0="sCZkvPP2drTBaxGlqVzF33qt4qbP3eQCwaPzO9ux" ascii wide fullword
        $a1="JUIxp2wTIZh6rHssxZYUC4T9RS68JCZ6tVQ6J8O1" ascii wide fullword
        $a2="7EWxsKmq6roK14dGUIPUGquAVbkBamWP6BVB5lT7" ascii wide fullword
        $a3="evnGdo0P7zDXfymAGWlotJbJ63NDJpnMGS5i0B5j" ascii wide fullword
        $a4="vu6BBhTNKPbfnApOSwW550T5Z4OeFmCFUMjXGqb0" ascii wide fullword
        $a5="st42c1G11rxvEyZ4GOa8rs9IzbSibRGU6W5lvpMU" ascii wide fullword
wxs@mbp wxs % tail 10k.yara
        $a9996="fGYdzxy4TH6GRDhxYRZLPKjd3LrsJTEWtlq9FVfT" ascii wide fullword
        $a9997="0GBvrjBPN9YDmjPTiu6YSZnIXgdvDKcfCEzfHOSY" ascii wide fullword
        $a9998="QGcVtdjnyoYX0OPGwDLT3ADGNIykDHFFEGvmG9Uq" ascii wide fullword
        $a9999="kbIKb92DNcSqd1B9PPFOOEkBrZLAebOENGX2cm1E" ascii wide fullword


condition:

        (any of ($a*))
}
wxs@mbp wxs %

So we have a single rule with 10k strings. Each string is using the "ascii wide fullword" modifiers. The "ascii" and "wide" modifiers each generate their own atoms, so we have 20k atoms being generated. This explains why we get this warning:
wxs@mbp wxs % yara 10k.yara /bin/ls
warning: rule "stringstest" in 10k.yara(10010): rule is slowing down scanning
wxs@mbp wxs %

This warning is because there is a 12k limit PER RULE on the number of atoms. Once you hit 12k atoms you get this warning. The rule will still compile and run properly, all your strings are still fine (I verified this with yara -s), but the warning is the compiler's way of telling you that you might want to rethink this rule. :)
As an aside, to verify this I commented out from $a6000 to the end of the strings ($a9999) and the warning does not appear. If I comment from $a6001 to the end the warning appears. This tells me that each rule is generating 2 atoms and we hit the max on $a6000 (which is actually string 6001 because they are zero indexed).
So with the understanding about the warning let's test it with some data.
wxs@mbp wxs % for i in 1 2 3; do /usr/bin/time ~/src/yara/yara -r 10k.yara tmp; done
warning: rule "stringstest" in 10k.yara(10010): rule is slowing down scanning
        8.40 real        55.55 user        12.06 sys
warning: rule "stringstest" in 10k.yara(10010): rule is slowing down scanning
        8.79 real        54.67 user        12.43 sys
warning: rule "stringstest" in 10k.yara(10010): rule is slowing down scanning
        9.12 real        53.13 user        11.62 sys
wxs@mbp wxs %

So that's about 8 or 9 seconds running with the rules above, with no strings commented out and all the modifiers Costin originally used.
My understanding is that adding more modifiers should not slow down scanning (unless you do something crazy like have 10k strings each generating 2 modifiers and hit the limit =b), so let's test that. To do this I removed the "ascii wide fullword" modifiers from each of the strings. The file now looks like this:
wxs@mbp wxs % head 10k.yara
rule stringstest {

strings:

        $a0="sCZkvPP2drTBaxGlqVzF33qt4qbP3eQCwaPzO9ux"
        $a1="JUIxp2wTIZh6rHssxZYUC4T9RS68JCZ6tVQ6J8O1"
        $a2="7EWxsKmq6roK14dGUIPUGquAVbkBamWP6BVB5lT7"
        $a3="evnGdo0P7zDXfymAGWlotJbJ63NDJpnMGS5i0B5j"
        $a4="vu6BBhTNKPbfnApOSwW550T5Z4OeFmCFUMjXGqb0"
        $a5="st42c1G11rxvEyZ4GOa8rs9IzbSibRGU6W5lvpMU"
wxs@mbp wxs %

Here's the output of those rules:
wxs@mbp wxs % for i in 1 2 3; do /usr/bin/time ~/src/yara/yara -r 10k.yara tmp; done
        8.50 real        54.17 user        11.56 sys
        8.29 real        54.16 user        12.62 sys
        8.17 real        54.85 user        12.69 sys
wxs@mbp wxs %

So 8 seconds to run these strings with no modifiers. This lines up with about what I would expect because adding more strings is not a big deal when it comes to the scanning algorithm.
Costin also sent a file with 10k rules, each with their own string. They look like this:
wxs@mbp wxs % head 10kv2.yara
rule stringstest0 {

strings:
        $a0="sCZkvPP2drTBaxGlqVzF33qt4qbP3eQCwaPzO9ux" ascii wide fullword


condition:
        (any of ($a*))
}
rule stringstest1 {
wxs@mbp wxs %

Here's what the output of those rules looks like:
wxs@mbp wxs % for i in 1 2 3; do /usr/bin/time ~/src/yara/yara -r 10kv2.yara tmp; done
        8.48 real        57.05 user        11.64 sys
        8.55 real        57.12 user        12.60 sys
        8.63 real        56.78 user        12.45 sys
wxs@mbp wxs %

Again, roughly the same amount of time as a single rule with 10k strings and the three modifiers.
For an apples-to-apples comparison I'll remove the modifiers from 10kv2.yara and run it. This is the output from that:
wxs@mbp wxs % for i in 1 2 3; do /usr/bin/time ~/src/yara/yara -r 10kv2.yara tmp; done
        8.27 real        56.53 user        11.55 sys
        8.36 real        56.35 user        12.41 sys
        8.46 real        56.03 user        12.31 sys
wxs@mbp wxs %

Same time for these rules too.
Conclusion: there is no difference (in my testing) between 1 rule with 10k strings and 10k rules with 1 string each. However, I can not replicate, nor explain, the difference between the two tests in Costin's environment.
To be perfectly clear: these tests are not about comparing my times to Costin's, they are about comparing the relative times of 1 rule with 10k strings against 10k rules with 1 string each.
YARA actually has reasonable profiling information built in that may be worth exploring here to see if that sheds some light on why Costin is getting different results that I can not seem to replicate.