Skip to content

Instantly share code, notes, and snippets.

@notareverser
Created February 25, 2022 14:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save notareverser/2a80ea47a5e9381f99624a1861f1b8b5 to your computer and use it in GitHub Desktop.
Save notareverser/2a80ea47a5e9381f99624a1861f1b8b5 to your computer and use it in GitHub Desktop.
Brief treatise on the tradeoffs between YARA rules made from strings, code, and data
Today for #100DaysOfYARA I want to further explore one of my favorite topics
"How to reliably detect libraries", or how to identify that a particular program has linked or otherwise included a particular library.
Detecting libraries (especially ones written in C) pose unique challenges compared to malware, to include:
- libraries tend to be platform/architecture nonspecific
- compilerisms overwhelm otherwise decent signal
- copy/pasta and groupthink across libraries
I think there are at least three different approaches to detecting the presence of a particular library (and really any program), as applied to YARA rule creation:
- code
- strings
- other data
Each one has upsides and downsides, and can be combined with the others to greater or lesser effect. Let's talk about some of these upsides/downsides.
Using code has the advantage that, in many cases, the presence of the compiled code can be both necessary and sufficient to identify the library. However, because libraries can be used in any context, and because many malware analysts have little control over their incoming pipeline, it is quite difficult to maintain a set of rules that capture all the various permutations of architecture, compiler settings, composition, and modification/forking that you are likely to see. That said, code-based rules remain my prefered method of expressing a YARA rule, as the benefit tends to outweigh the cost.
Strings-based rules have several advantages, which I've touched on briefly here
https://twitter.com/notareverser/status/1493936916416405505
However, strings-based rules to detect libraries have some decided disadvantages. The vast majority of strings on which to signal are often for debugging or other information purposes, which means they can be completely eliminated with no effect on the behavior of the library. Further, the presence of strings does not in any way inform the context behind why the strings are present, so the cost for validating the rule hit is transfered to an expensive resource, your analyst eyeballs.
If you doubt this, pick any strings-based YARA rule you have, go find the files that hit, replace the string value in the program with an equally-sized string (use sed, naturally), and run the files. If you can detect a notable difference, congrats, you might have a decent strings-based rule.
Finally we have other data. I would include within this category things like sets of function arguments, variables with context-specific values and usages (e.g. algorithm initialization), run-time type information (RTTI), or any other explicit or implicit value that implies the presence of the library. These can be wonderful to use as signal, because changing them generally has severe effect on the behavior of the library. However, finding and using such data requires a proficient developer or analyst, and thus the cost can be prohibitively high for many shops.
I thought it would be fun to explore this process using our old friend zlib
https://github.com/madler/zlib
I spent many hours testing a combination of strings, code, and data against hundreds of files across 4 architectures and several formats. I tried code, strings, and data, in several combinations. My most recent attempt is probably my best one, though by no means perfect.
rule zlib
{
strings:
$distfix = {10 05 01 00 17 05 01 01 13 05 11 00 1b 05 01 10 11 05 05 00 19 05 01 04 15 05 41 00 1d 05 01 40 10 05 03 00 18 05 01 02 14 05 21 00 1c 05 01 20 12 05 09 00 1a 05 01 08 16 05 81 00 40 05 00 00 10 05 02 00 17 05 81 01 13 05 19 00 1b 05 01 18 11 05 07 00 19 05 01 06 15 05 61 00 1d 05 01 60 10 05 04 00 18 05 01 03 14 05 31 00 1c 05 01 30 12 05 0d 00 1a 05 01 0c 16 05 c1 00 40 05 00 00}
$lenfix_prefix = {60 07 00 00 00 08 50 00 00 08 10 00 14 08 73 00 12 07 1f 00 00 08 70 00 00 08 30 00 00 09 c0 00 10 07 0a 00 00 08 60 00 00 08 20 00 00 09 a0 00 00 08 00 00 00 08 80 00 00 08 40 00 00 09 e0 00 10 07 06 00 00 08 58 00 00 08 18 00 00 09 90 00 13 07 3b 00 00 08 78 00 00 08 38 00 00 09 d0 00 11 07 11 00 00 08 68 00 00 08 28 00 00 09 b0 00 00 08 08 00 00 08 88 00 00 08 48 00 00 09 f0 00 10 07 04 00 00 08 54 00 00 08 14 00 15 08 e3 00 13 07 2b 00 00 08 74 00 }
condition:
all of them
}
I wanted to document the steps that I took, but unfortunately doing this properly would require more time than I'm willing to put toward the project. I will give you the highlights:
- obtain exemplars using "Mark Adler" and "Jean-loup Gailly" (for fun, do frequency analysis of these terms on your corpus, you may be surprised)
- segregate by architecture/format
- automatic code comparison using YARA signatures derived from each function in the file, statistical analysis
- source-code inspection for global/static variables of signficance
- dozens of iterations of encoding/searching/testing/validating
For reference, here is an alternate methodology that is cheaper and generally quite effective:
- find function in compiled library that seems unique
- automatically generate YARA signature, nopping out constants, relative offsets, addresses, immediates, and other likely-to-change variables
- add to ever-growing list of code-based rules with that label
- literally never touch it again unless you find a false positive
As I went through this, I wrote down some of the questions I would like for you to ponder in your quest to get better at creating YARA rules for libraries:
- What are the skills that I or others need to possess to have confidence in the quality of my rules?
- Without reverse engineering or source-code analysis, can I say anything concrete or defensible about the presence or absence of a library in a program?
- How would I go about obtaining a sufficient number of known exemplars to test the precision/recall of my rules
- How much cost am I willing to impose to generate or maintain a rule?
If you want to share your own experiences developing/maintaining library rules, hit me up on Twitter @notareverser
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment