notareverser/yara-rules-for-libraries.txt

## yara-rules-for-libraries.txt
Today for #100DaysOfYARA I want to further explore one of my favorite topics

"How to reliably detect libraries", or how to identify that a particular program has linked or otherwise included a particular library.


Detecting libraries (especially ones written in C) pose unique challenges compared to malware, to include:

- libraries tend to be platform/architecture nonspecific
- compilerisms overwhelm otherwise decent signal
- copy/pasta and groupthink across libraries


I think there are at least three different approaches to detecting the presence of a particular library (and really any program), as applied to YARA rule creation:

- code
- strings
- other data


Each one has upsides and downsides, and can be combined with the others to greater or lesser effect. Let's talk about some of these upsides/downsides.

Using code has the advantage that, in many cases, the presence of the compiled code can be both necessary and sufficient to identify the library. However, because libraries can be used in any context, and because many malware analysts have little control over their incoming pipeline, it is quite difficult to maintain a set of rules that capture all the various permutations of architecture, compiler settings, composition, and modification/forking that you are likely to see. That said, code-based rules remain my prefered method of expressing a YARA rule, as the benefit tends to outweigh the cost.


Strings-based rules have several advantages, which I've touched on briefly here
https://twitter.com/notareverser/status/1493936916416405505

However, strings-based rules to detect libraries have some decided disadvantages. The vast majority of strings on which to signal are often for debugging or other information purposes, which means they can be completely eliminated with no effect on the behavior of the library. Further, the presence of strings does not in any way inform the context behind why the strings are present, so the cost for validating the rule hit is transfered to an expensive resource, your analyst eyeballs.

If you doubt this, pick any strings-based YARA rule you have, go find the files that hit, replace the string value in the program with an equally-sized string (use sed, naturally), and run the files. If you can detect a notable difference, congrats, you might have a decent strings-based rule.


Finally we have other data. I would include within this category things like sets of function arguments, variables with context-specific values and usages (e.g. algorithm initialization), run-time type information (RTTI), or any other explicit or implicit value that implies the presence of the library. These can be wonderful to use as signal, because changing them generally has severe effect on the behavior of the library. However, finding and using such data requires a proficient developer or analyst, and thus the cost can be prohibitively high for many shops.


I thought it would be fun to explore this process using our old friend zlib

https://github.com/madler/zlib

I spent many hours testing a combination of strings, code, and data against hundreds of files across 4 architectures and several formats. I tried code, strings, and data, in several combinations. My most recent attempt is probably my best one, though by no means perfect.


rule zlib
{
  strings:
      $distfix = {10 05 01 00 17 05 01 01 13 05 11 00 1b 05 01 10 11 05 05 00 19 05 01 04 15 05 41 00 1d 05 01 40 10 05 03 00 18 05 01 02 14 05 21 00 1c 05 01 20 12 05 09 00 1a 05 01 08 16 05 81 00 40 05 00 00 10 05 02 00 17 05 81 01 13 05 19 00 1b 05 01 18 11 05 07 00 19 05 01 06 15 05 61 00 1d 05 01 60 10 05 04 00 18 05 01 03 14 05 31 00 1c 05 01 30 12 05 0d 00 1a 05 01 0c 16 05 c1 00 40 05 00 00}

      $lenfix_prefix = {60 07 00 00 00 08 50 00 00 08 10 00 14 08 73 00 12 07 1f 00 00 08 70 00 00 08 30 00 00 09 c0 00 10 07 0a 00 00 08 60 00 00 08 20 00 00 09 a0 00 00 08 00 00 00 08 80 00 00 08 40 00 00 09 e0 00 10 07 06 00 00 08 58 00 00 08 18 00 00 09 90 00 13 07 3b 00 00 08 78 00 00 08 38 00 00 09 d0 00 11 07 11 00 00 08 68 00 00 08 28 00 00 09 b0 00 00 08 08 00 00 08 88 00 00 08 48 00 00 09 f0 00 10 07 04 00 00 08 54 00 00 08 14 00 15 08 e3 00 13 07 2b 00 00 08 74 00 }
  condition:
    all of them
}


I wanted to document the steps that I took, but unfortunately doing this properly would require more time than I'm willing to put toward the project. I will give you the highlights:

- obtain exemplars using "Mark Adler" and "Jean-loup Gailly" (for fun, do frequency analysis of these terms on your corpus, you may be surprised)
- segregate by architecture/format
- automatic code comparison using YARA signatures derived from each function in the file, statistical analysis
- source-code inspection for global/static variables of signficance
- dozens of iterations of encoding/searching/testing/validating


For reference, here is an alternate methodology that is cheaper and generally quite effective:
- find function in compiled library that seems unique
- automatically generate YARA signature, nopping out constants, relative offsets, addresses, immediates, and other likely-to-change variables
- add to ever-growing list of code-based rules with that label
- literally never touch it again unless you find a false positive


As I went through this, I wrote down some of the questions I would like for you to ponder in your quest to get better at creating YARA rules for libraries:

- What are the skills that I or others need to possess to have confidence in the quality of my rules?
- Without reverse engineering or source-code analysis, can I say anything concrete or defensible about the presence or absence of a library in a program?
- How would I go about obtaining a sufficient number of known exemplars to test the precision/recall of my rules
- How much cost am I willing to impose to generate or maintain a rule?


If you want to share your own experiences developing/maintaining library rules, hit me up on Twitter @notareverser
	Today for #100DaysOfYARA I want to further explore one of my favorite topics

	"How to reliably detect libraries", or how to identify that a particular program has linked or otherwise included a particular library.


	Detecting libraries (especially ones written in C) pose unique challenges compared to malware, to include:

	- libraries tend to be platform/architecture nonspecific
	- compilerisms overwhelm otherwise decent signal
	- copy/pasta and groupthink across libraries


	I think there are at least three different approaches to detecting the presence of a particular library (and really any program), as applied to YARA rule creation:

	- code
	- strings
	- other data


	Each one has upsides and downsides, and can be combined with the others to greater or lesser effect. Let's talk about some of these upsides/downsides.

	Using code has the advantage that, in many cases, the presence of the compiled code can be both necessary and sufficient to identify the library. However, because libraries can be used in any context, and because many malware analysts have little control over their incoming pipeline, it is quite difficult to maintain a set of rules that capture all the various permutations of architecture, compiler settings, composition, and modification/forking that you are likely to see. That said, code-based rules remain my prefered method of expressing a YARA rule, as the benefit tends to outweigh the cost.



	Strings-based rules have several advantages, which I've touched on briefly here
	https://twitter.com/notareverser/status/1493936916416405505

	However, strings-based rules to detect libraries have some decided disadvantages. The vast majority of strings on which to signal are often for debugging or other information purposes, which means they can be completely eliminated with no effect on the behavior of the library. Further, the presence of strings does not in any way inform the context behind why the strings are present, so the cost for validating the rule hit is transfered to an expensive resource, your analyst eyeballs.

	If you doubt this, pick any strings-based YARA rule you have, go find the files that hit, replace the string value in the program with an equally-sized string (use sed, naturally), and run the files. If you can detect a notable difference, congrats, you might have a decent strings-based rule.



	Finally we have other data. I would include within this category things like sets of function arguments, variables with context-specific values and usages (e.g. algorithm initialization), run-time type information (RTTI), or any other explicit or implicit value that implies the presence of the library. These can be wonderful to use as signal, because changing them generally has severe effect on the behavior of the library. However, finding and using such data requires a proficient developer or analyst, and thus the cost can be prohibitively high for many shops.


	I thought it would be fun to explore this process using our old friend zlib

	https://github.com/madler/zlib

	I spent many hours testing a combination of strings, code, and data against hundreds of files across 4 architectures and several formats. I tried code, strings, and data, in several combinations. My most recent attempt is probably my best one, though by no means perfect.


	rule zlib
	{
	strings:
	$distfix = {10 05 01 00 17 05 01 01 13 05 11 00 1b 05 01 10 11 05 05 00 19 05 01 04 15 05 41 00 1d 05 01 40 10 05 03 00 18 05 01 02 14 05 21 00 1c 05 01 20 12 05 09 00 1a 05 01 08 16 05 81 00 40 05 00 00 10 05 02 00 17 05 81 01 13 05 19 00 1b 05 01 18 11 05 07 00 19 05 01 06 15 05 61 00 1d 05 01 60 10 05 04 00 18 05 01 03 14 05 31 00 1c 05 01 30 12 05 0d 00 1a 05 01 0c 16 05 c1 00 40 05 00 00}

	$lenfix_prefix = {60 07 00 00 00 08 50 00 00 08 10 00 14 08 73 00 12 07 1f 00 00 08 70 00 00 08 30 00 00 09 c0 00 10 07 0a 00 00 08 60 00 00 08 20 00 00 09 a0 00 00 08 00 00 00 08 80 00 00 08 40 00 00 09 e0 00 10 07 06 00 00 08 58 00 00 08 18 00 00 09 90 00 13 07 3b 00 00 08 78 00 00 08 38 00 00 09 d0 00 11 07 11 00 00 08 68 00 00 08 28 00 00 09 b0 00 00 08 08 00 00 08 88 00 00 08 48 00 00 09 f0 00 10 07 04 00 00 08 54 00 00 08 14 00 15 08 e3 00 13 07 2b 00 00 08 74 00 }
	condition:
	all of them
	}



	I wanted to document the steps that I took, but unfortunately doing this properly would require more time than I'm willing to put toward the project. I will give you the highlights:

	- obtain exemplars using "Mark Adler" and "Jean-loup Gailly" (for fun, do frequency analysis of these terms on your corpus, you may be surprised)
	- segregate by architecture/format
	- automatic code comparison using YARA signatures derived from each function in the file, statistical analysis
	- source-code inspection for global/static variables of signficance
	- dozens of iterations of encoding/searching/testing/validating



	For reference, here is an alternate methodology that is cheaper and generally quite effective:
	- find function in compiled library that seems unique
	- automatically generate YARA signature, nopping out constants, relative offsets, addresses, immediates, and other likely-to-change variables
	- add to ever-growing list of code-based rules with that label
	- literally never touch it again unless you find a false positive




	As I went through this, I wrote down some of the questions I would like for you to ponder in your quest to get better at creating YARA rules for libraries:

	- What are the skills that I or others need to possess to have confidence in the quality of my rules?
	- Without reverse engineering or source-code analysis, can I say anything concrete or defensible about the presence or absence of a library in a program?
	- How would I go about obtaining a sufficient number of known exemplars to test the precision/recall of my rules
	- How much cost am I willing to impose to generate or maintain a rule?


	If you want to share your own experiences developing/maintaining library rules, hit me up on Twitter @notareverser