notareverser/code-signatures.treatise.txt

## code-signatures.treatise.txt
Today for #100DaysOfYARA I want to dive in to some of the dirty secrets of creating/maintaining code-based YARA signatures

Let's use SQLite3 as an example. Go get the source here (I prefer the amalgamation):
https://sqlite.org/download.html

I would like to reliably detect when a file is using SQLite. I often look at Windows executables, so I'm going to first concentrate on x86 programs that use this library. The easiest way to find them is to first concentrate on cleartext strings. In this case, I'm gonna pop over to VirusTotal and search for an easily-identifiable string:


content: "failed to allocate %u bytes of memory"  type:pe


This is gonna find at least files that use either of SQLite or libpng, but we can sort that out as we go. I'll grab one to two hundred of these files, as that will give me a decent amount of signal to work with. You'll get a mix of 32/64 bit files, but you can split those out into manageable buckets. Get rid of any files that do not have the string (as VT may have returned you results that are found in a subfile, and that's probably not worth digging into at the moment)

Here's some YARA rules to use for this:

rule SQLite3_string
{
   // may false-positive to libpng
   strings: $v = "failed to allocate %u bytes of memory"
   condition: $v
}

rule NOT_SQLite3_string
{
   strings: $v = "failed to allocate %u bytes of memory"
   condition: not $v
}

You can modify these rules to do things like exclude hits in an overlay or resource, but that's your business. I'll generally leave them be so I can get a good sense for what kind of nonsense I'm dealing with


$ ls *.virustotal | wc -l
158

$ yara /tmp/sqlite3.yara . -i SQLite3_string | wc -l
143

$ for x in `ls *.virustotal | parallel 'yara /tmp/sqlite3.yara -i NOT_SQLite3_string {}' | awk '{print $2}'`
> do
> rm $x
> done

$ ls *.virustotal | wc -l
143


Here's the code from sqlite3.c that we're looking for:

static void *sqlite3MemMalloc(int nByte){
#ifdef SQLITE_MALLOCSIZE
  void *p;
  testcase( ROUND8(nByte)==nByte );
  p = SQLITE_MALLOC( nByte );
  if( p==0 ){
    testcase( sqlite3GlobalConfig.xLog!=0 );
    sqlite3_log(SQLITE_NOMEM, "failed to allocate %u bytes of memory", nByte);
  }
  return p;
#else
  sqlite3_int64 *p;
  assert( nByte>0 );
  testcase( ROUND8(nByte)!=nByte );
  p = SQLITE_MALLOC( nByte+8 );
  if( p ){
    p[0] = nByte;
    p++;
  }else{
    testcase( sqlite3GlobalConfig.xLog!=0 );
    sqlite3_log(SQLITE_NOMEM, "failed to allocate %u bytes of memory", nByte);
  }
  return (void *)p;
#endif
}

You can read through the macros if you want, but this is basically doing a malloc and then logging an error if it fails.


Crack one of your files open in IDA, find the string "failed to allocate %u bytes of memory" and go to cross-references. Bam, you should be at sqlite3MemMalloc, not a bad place to start. If the string isn't in a loaded section (e.g. in an embedded file in the overlay or a resource), just pick another file until you find one that this works for, it's not that big of a deal. Here's an example from a file:

d6a82e9a0fa1747bc5097423dc17c0af

void *__cdecl sqlite3MemMalloc(size_t Size)
{
  void *result; // eax

  result = malloc(Size);
  if ( !result )
  {
    sqlite3_log(7, "failed to allocate %u bytes of memory", Size);
    return 0;
  }
  return result;
}

rule sqlite3MemMalloc
{
   strings:
      $chunk_00464230 = {55 8b ec 56 ff 75 08 ff 15 ?? ?? ?? ?? 8b f0 83 c4 04 85 f6 75 14 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
   condition:
      any of them
}

$ yara /tmp/sqlite3.yara -i sqlite3MemMalloc . | wc -l
5

Well that's a pretty abysmal hit rate. Let's try another file and see what's what.

fdea8d5510f4b163b167d210dc58f5fe

Ooh, this one has symbols, nice

_DWORD *__cdecl sqlite3MemMalloc(int a1)
{
  _DWORD *v1; // eax
  _DWORD *result; // eax

  v1 = malloc(a1 + 8);
  if ( v1 )
  {
    *v1 = a1;
    result = v1 + 2;
    *(result - 1) = a1 >> 31;
  }
  else
  {
    sqlite3_log(7, "failed to allocate %u bytes of memory", a1);
    return 0;
  }
  return result;
}

rule _sqlite3MemMalloc
{
   strings:
      $chunk_6e682020 = {53 83 ec 28 8b 5c 24 30 8d 43 08 89 04 24 e8 ?? ?? ?? ?? 85 c0 74 19 89 18 c1 fb 1f 83 c0 08 89 58 fc 83 c4 28 5b c3 89 f6 8d bc 27 00 00 00 00 89 5c 24 08 c7 44 24 04 ?? ?? ?? ?? c7 04 24 07 00 00 00 89 44 24 1c e8 ?? ?? ?? ?? 8b 44 24 1c eb d0}
   condition:
      any of them
}

$ yara /tmp/sqlite3.yara . -i _sqlite3MemMalloc | wc -l
1

Ugh, even worse, only a single file. We have two rules for the same function, let's rename them to use incrementing suffixes (_1, _2, etc)

Try again:

ecfc48ee83d87b149e8f5094ce5e2371

int __cdecl sqlite3MemMalloc(int a1)
{
  int v1; // esi

  v1 = SQLITE_MALLOC(a1);
  if ( !v1 )
    sqlite3_log(7, "failed to allocate %u bytes of memory", a1);
  return v1;
}

rule sqlite3MemMalloc_3
{
   strings:
      $chunk_100ee160 = {55 8b ec 56 ff 75 08 e8 ?? ?? ?? ?? 8b f0 59 85 f6 75 12 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
   condition:
      any of them
}

$ yara /tmp/sqlite3.yara -i sqlite3MemMalloc_3 . | wc -l
29

Now we're getting somewhere! Let's pause here and compare our three rules:

rule sqlite3MemMalloc_1
{
   strings:
      $chunk_00464230 = {55 8b ec 56 ff 75 08 ff 15 ?? ?? ?? ?? 8b f0 83 c4 04 85 f6 75 14 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
   condition:
      any of them
}

rule sqlite3MemMalloc_2
{
   strings:
      $chunk_6e682020 = {53 83 ec 28 8b 5c 24 30 8d 43 08 89 04 24 e8 ?? ?? ?? ?? 85 c0 74 19 89 18 c1 fb 1f 83 c0 08 89 58 fc 83 c4 28 5b c3 89 f6 8d bc 27 00 00 00 00 89 5c 24 08 c7 44 24 04 ?? ?? ?? ?? c7 04 24 07 00 00 00 89 44 24 1c e8 ?? ?? ?? ?? 8b 44 24 1c eb d0}
   condition:
      any of them
}

rule sqlite3MemMalloc_3
{
   strings:
      $chunk_100ee160 = {55 8b ec 56 ff 75 08 e8 ?? ?? ?? ?? 8b f0 59 85 f6 75 12 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
   condition:
      any of them
}

If you read x86 assembly, you've already noticed in IDA that these functions have different stack layouts, different address bases, different call instructions (rel/abs), etc. If you don't read x86 assembly, you've noticed that they have different lengths and byte values. This sucks, as it isn't going to be easy to merge them into a single string. What I will often do in these situations is just combine them into a single rule:

rule SQLite3_code
{
  strings:
   $memMalloc_1 = {55 8b ec 56 ff 75 08 ff 15 ?? ?? ?? ?? 8b f0 83 c4 04 85 f6 75 14 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
   $memMalloc_2 = {53 83 ec 28 8b 5c 24 30 8d 43 08 89 04 24 e8 ?? ?? ?? ?? 85 c0 74 19 89 18 c1 fb 1f 83 c0 08 89 58 fc 83 c4 28 5b c3 89 f6 8d bc 27 00 00 00 00 89 5c 24 08 c7 44 24 04 ?? ?? ?? ?? c7 04 24 07 00 00 00 89 44 24 1c e8 ?? ?? ?? ?? 8b 44 24 1c eb d0}
   $memMalloc_3 = {55 8b ec 56 ff 75 08 e8 ?? ?? ?? ?? 8b f0 59 85 f6 75 12 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}

  condition:

     any of ($memMalloc_*)

}

$ yara /tmp/sqlite3.yara -i SQLite3_code . | wc -l
35


We can keep working through the files we have

3ccf4e81be940e1592ff99c7fe45d7fb

void *__stdcall sqlite3MemMalloc(size_t Size)
{
  void *v1; // esi

  v1 = malloc(Size);
  if ( !v1 )
    sqlite3_log(7, "failed to allocate %u bytes of memory", Size);
  return v1;
}


rule sqlite3MemMalloc
{
   strings:
      $chunk_005e3d9d = {55 8b ec 56 ff 75 08 e8 ?? ?? ?? ?? 8b f0 59 85 f6 75 12 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c2 04 00}
   condition:
      any of them
}

$ yara /tmp/sqlite3.yara -i sqlite3MemMalloc . | wc -l
4


This is really starting to suck! We're not picking up a lot of files with each new iteration, and we selected a ridiculously small number of files to start with O(100). Time to start thinking. WHY is this happening like this? The reasons most obvious to me:

- the original code is full of macros and other pre-processor directives, giving the compiler and the developer ample opportunity to screw with the compilation process
- along those lines, different optimization settings are going to produce different amounts of calls, inlining, and other modifications to the callgraph/opcodes
- the function isn't terribly large, so misses are going to be big misses


I don't want to continue down this road very much longer, as I'm running out of patience. What I'll generally do in this case is go back to the source (in the case of a library) or the files I'm inspecting (in the case of a family) and try to find a function that is either a bit longer, a bit less generic, or that has fewer other ways to screw me over. Make no mistake, sometimes this is not possible, and you may end up with a code-based rule that has 10+ clauses.

What's the point of all this? I want you take away a few key points:
- code-based signatures are great when they work, but there are lots of ways they can fail
- strings-based signatures have completely different failure modes, to include complete absence of signal
- the effort involved in creating a "great" signature can be tremendously high, while the effort involved in "good enough" signatures is generally far lower
- your signatures are only ever going to be as good as the data you test them against
	Today for #100DaysOfYARA I want to dive in to some of the dirty secrets of creating/maintaining code-based YARA signatures

	Let's use SQLite3 as an example. Go get the source here (I prefer the amalgamation):
	https://sqlite.org/download.html

	I would like to reliably detect when a file is using SQLite. I often look at Windows executables, so I'm going to first concentrate on x86 programs that use this library. The easiest way to find them is to first concentrate on cleartext strings. In this case, I'm gonna pop over to VirusTotal and search for an easily-identifiable string:


	content: "failed to allocate %u bytes of memory" type:pe


	This is gonna find at least files that use either of SQLite or libpng, but we can sort that out as we go. I'll grab one to two hundred of these files, as that will give me a decent amount of signal to work with. You'll get a mix of 32/64 bit files, but you can split those out into manageable buckets. Get rid of any files that do not have the string (as VT may have returned you results that are found in a subfile, and that's probably not worth digging into at the moment)

	Here's some YARA rules to use for this:

	rule SQLite3_string
	{
	// may false-positive to libpng
	strings: $v = "failed to allocate %u bytes of memory"
	condition: $v
	}

	rule NOT_SQLite3_string
	{
	strings: $v = "failed to allocate %u bytes of memory"
	condition: not $v
	}

	You can modify these rules to do things like exclude hits in an overlay or resource, but that's your business. I'll generally leave them be so I can get a good sense for what kind of nonsense I'm dealing with


	$ ls *.virustotal \| wc -l
	158

	$ yara /tmp/sqlite3.yara . -i SQLite3_string \| wc -l
	143

	$ for x in `ls *.virustotal \| parallel 'yara /tmp/sqlite3.yara -i NOT_SQLite3_string {}' \| awk '{print $2}'`
	> do
	> rm $x
	> done

	$ ls *.virustotal \| wc -l
	143


	Here's the code from sqlite3.c that we're looking for:

	static void *sqlite3MemMalloc(int nByte){
	#ifdef SQLITE_MALLOCSIZE
	void *p;
	testcase( ROUND8(nByte)==nByte );
	p = SQLITE_MALLOC( nByte );
	if( p==0 ){
	testcase( sqlite3GlobalConfig.xLog!=0 );
	sqlite3_log(SQLITE_NOMEM, "failed to allocate %u bytes of memory", nByte);
	}
	return p;
	#else
	sqlite3_int64 *p;
	assert( nByte>0 );
	testcase( ROUND8(nByte)!=nByte );
	p = SQLITE_MALLOC( nByte+8 );
	if( p ){
	p[0] = nByte;
	p++;
	}else{
	testcase( sqlite3GlobalConfig.xLog!=0 );
	sqlite3_log(SQLITE_NOMEM, "failed to allocate %u bytes of memory", nByte);
	}
	return (void *)p;
	#endif
	}

	You can read through the macros if you want, but this is basically doing a malloc and then logging an error if it fails.



	Crack one of your files open in IDA, find the string "failed to allocate %u bytes of memory" and go to cross-references. Bam, you should be at sqlite3MemMalloc, not a bad place to start. If the string isn't in a loaded section (e.g. in an embedded file in the overlay or a resource), just pick another file until you find one that this works for, it's not that big of a deal. Here's an example from a file:

	d6a82e9a0fa1747bc5097423dc17c0af

	void *__cdecl sqlite3MemMalloc(size_t Size)
	{
	void *result; // eax

	result = malloc(Size);
	if ( !result )
	{
	sqlite3_log(7, "failed to allocate %u bytes of memory", Size);
	return 0;
	}
	return result;
	}

	rule sqlite3MemMalloc
	{
	strings:
	$chunk_00464230 = {55 8b ec 56 ff 75 08 ff 15 ?? ?? ?? ?? 8b f0 83 c4 04 85 f6 75 14 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
	condition:
	any of them
	}

	$ yara /tmp/sqlite3.yara -i sqlite3MemMalloc . \| wc -l
	5

	Well that's a pretty abysmal hit rate. Let's try another file and see what's what.

	fdea8d5510f4b163b167d210dc58f5fe

	Ooh, this one has symbols, nice

	_DWORD *__cdecl sqlite3MemMalloc(int a1)
	{
	_DWORD *v1; // eax
	_DWORD *result; // eax

	v1 = malloc(a1 + 8);
	if ( v1 )
	{
	*v1 = a1;
	result = v1 + 2;
	*(result - 1) = a1 >> 31;
	}
	else
	{
	sqlite3_log(7, "failed to allocate %u bytes of memory", a1);
	return 0;
	}
	return result;
	}

	rule _sqlite3MemMalloc
	{
	strings:
	$chunk_6e682020 = {53 83 ec 28 8b 5c 24 30 8d 43 08 89 04 24 e8 ?? ?? ?? ?? 85 c0 74 19 89 18 c1 fb 1f 83 c0 08 89 58 fc 83 c4 28 5b c3 89 f6 8d bc 27 00 00 00 00 89 5c 24 08 c7 44 24 04 ?? ?? ?? ?? c7 04 24 07 00 00 00 89 44 24 1c e8 ?? ?? ?? ?? 8b 44 24 1c eb d0}
	condition:
	any of them
	}

	$ yara /tmp/sqlite3.yara . -i _sqlite3MemMalloc \| wc -l
	1

	Ugh, even worse, only a single file. We have two rules for the same function, let's rename them to use incrementing suffixes (_1, _2, etc)

	Try again:

	ecfc48ee83d87b149e8f5094ce5e2371

	int __cdecl sqlite3MemMalloc(int a1)
	{
	int v1; // esi

	v1 = SQLITE_MALLOC(a1);
	if ( !v1 )
	sqlite3_log(7, "failed to allocate %u bytes of memory", a1);
	return v1;
	}

	rule sqlite3MemMalloc_3
	{
	strings:
	$chunk_100ee160 = {55 8b ec 56 ff 75 08 e8 ?? ?? ?? ?? 8b f0 59 85 f6 75 12 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
	condition:
	any of them
	}

	$ yara /tmp/sqlite3.yara -i sqlite3MemMalloc_3 . \| wc -l
	29

	Now we're getting somewhere! Let's pause here and compare our three rules:

	rule sqlite3MemMalloc_1
	{
	strings:
	$chunk_00464230 = {55 8b ec 56 ff 75 08 ff 15 ?? ?? ?? ?? 8b f0 83 c4 04 85 f6 75 14 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
	condition:
	any of them
	}

	rule sqlite3MemMalloc_2
	{
	strings:
	$chunk_6e682020 = {53 83 ec 28 8b 5c 24 30 8d 43 08 89 04 24 e8 ?? ?? ?? ?? 85 c0 74 19 89 18 c1 fb 1f 83 c0 08 89 58 fc 83 c4 28 5b c3 89 f6 8d bc 27 00 00 00 00 89 5c 24 08 c7 44 24 04 ?? ?? ?? ?? c7 04 24 07 00 00 00 89 44 24 1c e8 ?? ?? ?? ?? 8b 44 24 1c eb d0}
	condition:
	any of them
	}

	rule sqlite3MemMalloc_3
	{
	strings:
	$chunk_100ee160 = {55 8b ec 56 ff 75 08 e8 ?? ?? ?? ?? 8b f0 59 85 f6 75 12 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
	condition:
	any of them
	}

	If you read x86 assembly, you've already noticed in IDA that these functions have different stack layouts, different address bases, different call instructions (rel/abs), etc. If you don't read x86 assembly, you've noticed that they have different lengths and byte values. This sucks, as it isn't going to be easy to merge them into a single string. What I will often do in these situations is just combine them into a single rule:

	rule SQLite3_code
	{
	strings:
	$memMalloc_1 = {55 8b ec 56 ff 75 08 ff 15 ?? ?? ?? ?? 8b f0 83 c4 04 85 f6 75 14 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}
	$memMalloc_2 = {53 83 ec 28 8b 5c 24 30 8d 43 08 89 04 24 e8 ?? ?? ?? ?? 85 c0 74 19 89 18 c1 fb 1f 83 c0 08 89 58 fc 83 c4 28 5b c3 89 f6 8d bc 27 00 00 00 00 89 5c 24 08 c7 44 24 04 ?? ?? ?? ?? c7 04 24 07 00 00 00 89 44 24 1c e8 ?? ?? ?? ?? 8b 44 24 1c eb d0}
	$memMalloc_3 = {55 8b ec 56 ff 75 08 e8 ?? ?? ?? ?? 8b f0 59 85 f6 75 12 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c3}

	condition:

	any of ($memMalloc_*)

	}

	$ yara /tmp/sqlite3.yara -i SQLite3_code . \| wc -l
	35


	We can keep working through the files we have

	3ccf4e81be940e1592ff99c7fe45d7fb

	void *__stdcall sqlite3MemMalloc(size_t Size)
	{
	void *v1; // esi

	v1 = malloc(Size);
	if ( !v1 )
	sqlite3_log(7, "failed to allocate %u bytes of memory", Size);
	return v1;
	}


	rule sqlite3MemMalloc
	{
	strings:
	$chunk_005e3d9d = {55 8b ec 56 ff 75 08 e8 ?? ?? ?? ?? 8b f0 59 85 f6 75 12 ff 75 08 68 ?? ?? ?? ?? 6a 07 e8 ?? ?? ?? ?? 83 c4 0c 8b c6 5e 5d c2 04 00}
	condition:
	any of them
	}

	$ yara /tmp/sqlite3.yara -i sqlite3MemMalloc . \| wc -l
	4


	This is really starting to suck! We're not picking up a lot of files with each new iteration, and we selected a ridiculously small number of files to start with O(100). Time to start thinking. WHY is this happening like this? The reasons most obvious to me:

	- the original code is full of macros and other pre-processor directives, giving the compiler and the developer ample opportunity to screw with the compilation process
	- along those lines, different optimization settings are going to produce different amounts of calls, inlining, and other modifications to the callgraph/opcodes
	- the function isn't terribly large, so misses are going to be big misses


	I don't want to continue down this road very much longer, as I'm running out of patience. What I'll generally do in this case is go back to the source (in the case of a library) or the files I'm inspecting (in the case of a family) and try to find a function that is either a bit longer, a bit less generic, or that has fewer other ways to screw me over. Make no mistake, sometimes this is not possible, and you may end up with a code-based rule that has 10+ clauses.

	What's the point of all this? I want you take away a few key points:
	- code-based signatures are great when they work, but there are lots of ways they can fail
	- strings-based signatures have completely different failure modes, to include complete absence of signal
	- the effort involved in creating a "great" signature can be tremendously high, while the effort involved in "good enough" signatures is generally far lower
	- your signatures are only ever going to be as good as the data you test them against