Skip to content

Instantly share code, notes, and snippets.

@eliminmax
Last active April 24, 2023 00:13
Show Gist options
  • Save eliminmax/1f83b22d19e23f5ac1f05a2dbc7d671e to your computer and use it in GitHub Desktop.
Save eliminmax/1f83b22d19e23f5ac1f05a2dbc7d671e to your computer and use it in GitHub Desktop.

EvilTCC

Conceptual Overview

Compilers

Compilers are incredible tools that revolutionized computing when they were invented. They enable programmers to work much more efficiently. Instead of first coming up with a program plan, figuring out how to translate into machine code, and entering that machine code onto punch cards, they can simply get the computer to do that part. This frees up programmers to work on far more substantial things, like arguing about which language is best, which text editor or IDE is best, tabs vs spaces, which operating system is best, whether or not font ligatures are good for programming, and whether or not this joke has gone on for too long.

It also enables faster development of far more complex and robust programs.

One interesting fact about them is that it is common for them to be "self-hosting", meaning that they are written in the same language that they are designed to compile.

Self-hosting compilers

To compile a Zig compiler, you must first have a copy of the Zig compiler. To compile a Rust compiler, you must first have a copy of the Rust compiler.

To compile a C compiler, you must first have a copy of a C compiler.

Of course, the first version of such a compiler was written in a different language - Rust's original compiler was written in OCaml, Zig's in C++, and C++ was originally a series of macros on top of C.

As for C, the grandfather of most modern languages, well, it's complicated.

Chistory

The late Dennis M. Richie created the C programming language. He also co-wrote The C Programming Language, a book about the language with a very generic title. Incidentally, that book introduced the world to the first ever "Hello world" program.

In 1993, he wrote an article for the Association for Computing Machinery about the development of the C language. Following the precedent of... uncreative titles set by his book, it was called The Development of the C Language. Sidenote: the title of the HTML page is Chistory, which I find a bit goofy, and appreciate.

In it, he explained that his colleague at AT&T's Bell Labs, Ken Thompson, led a group that aimed to find alternatives to the Multics computer system that they'd been working on, as it became clear that the project's prospects were grim.

Thompson wanted to create a comfortable computing environment constructed according to his own design, using whatever means were available. His plans, it is evident in retrospect, incorporated many of the innovative aspects of Multics, including an explicit notion of a process as a locus of control, a tree-structured file system, a command interpreter as user-level program, simple representation of text files, and generalized access to devices. They excluded others, such as unified access to memory and to files.

Thompson was faced with a hardware environment cramped and spartan even for the time: the DEC PDP-7 on which he started in 1968 was a machine with 8K 18-bit words of memory and no software useful to him. While wanting to use a higher-level language, he wrote the original Unix system in PDP-7 assembler. At the start, he did not even program on the PDP-7 itself, but instead used a set of macros for the GEMAP assembler on a GE-635 machine. A postprocessor generated a paper tape readable by the PDP-7.

These tapes were carried from the GE machine to the PDP-7 for testing until a primitive Unix kernel, an editor, an assembler, a simple shell (command interpreter), and a few utilities (like the Unix rm, cat, cp commands) were completed. After this point, the operating system was self-supporting: programs could be written and tested without resort to paper tape, and development continued on the PDP-7 itself.

Not long after Unix first ran on the PDP-7, in 1969, Doug McIlroy created the new system's first higher-level language: an implementation of McClure's TMG. TMG is a language for writing compilers (more generally, TransMoGrifiers) in a top-down, recursive-descent style that combines context-free syntax notation with procedural elements. McIlroy and Bob Morris had used TMG to write the early PL/I compiler for Multics.

In the interest of not just pasting in the whole paper, the short slightly less long version of what happened next is as follows:

  1. Inspired by McIlroy's work in porting TMG, Thompson created a systems programming language for use on this embryonic form of Unix.
  • In the article, Richie described this language, known as B, as "BCPL squeezed into 8K bytes of memory and filtered through Thompson's brain"
  1. After writing a B compiler in TMG, Thompson wrote a new B compiler in B.
  • This process to create a self-hosting compiler is known as bootstrapping.
    • Both self-hosting and bootstrapping have several other meanings in the context of computing
      • I'm adding somewhat tangentially-related bullet points because I find them interesting
        • I'm adding a bunch of extra bullet points like this one because I find this kind of self-referential meta-observational anti-humor amusing.
          • I think I may have just coined the phrase "self-referential meta-observational anti-humor"
            • I think this bit, if you can call it that, has probably gone on for too long.
  1. Richie created a cross-compiler for B, which "translated B to GE-635 machine instructions, not threaded code."
  • He notes that it generated code for a 36-bit mainframe while running "on an 18-bit machine with 4K words of user address space."
  1. After Unix began to show promise, the team developing it, led by Thompson, acquired a new mini-computer with a different enough internal structure that B "began to feel awkward, even silly..."
  2. Richie began work on a type system for B, which had previously only had the concept of a "cell" data type. He called the result NB, for non-binary New Balance "new B".
  3. Richie decided that it should not be called NB anymore, as it had diverged too much, and called it C.
  • In his 1993 paper that I got the info in this section from, he notes that he "[left] open the question whether the name represented a progression through the alphabet or through the letters in BCPL"

Skipping a little over a decade, in 1983, Unix and C were both successes, and Thompson and Richie were jointly given the Turing Award "for their development of generic operating systems theory and specifically for the implementation of the UNIX operating system." The Turing award is an annual award given out by the Association for Computing Machinery, "to recognize contributions of a technical nature which are of lasting and major technical importance to the computing field." Note that this is the same organization that Richie would later write the referenced article for.

During the award ceremony, in 1984, Thompson gave an acceptance speech, with an accompanying paper, both titled Reflections on Trusting Trust.

Reflections on Trusting Trust

In his acceptance speech and the accompanying paper, Thompson decided to do something unusual. Instead of talking about Unix, he instead talked about the compiler that shipped with it.

He pointed out that it's possible that the compiler could be set up to quietly and deliberately "miscompile" certain specific pieces of code, using the example of the Unix login program. If the compiler were to find a specific piece of code within that program, it could add a backdoor, granting Thompson access to the system.

Then he moved on to the really nasty part. Because the compiler was self-hosting, it could be set up to miscompile itself, adding all of this back in if it was then removed from the source code. He argued that there was no way to conclusively rule out the possibility that he had managed to slip a backdoor into every Unix system in the world.

As a quick aside, he was the first person to apply the term "Trojan horse" to malicious software, and he did it in the context of this speech and paper.

If he actually did this, it was quite possibly the ultimate supply-chain attack. The talk presented it as a hypothetical, but there are some people who have claimed he actually implemented this, though did not spread it, and sometimes will even claim that he confirmed that over email or on USENET. Some go as far as to claim that he did unleash it, though likely accidentally. In terms of concrete evidence, there's about as much for an actual, in-the-wild "evil" compiler that can be traced back to Thompson as there is for the existence of Bigfoot.

That said, I find it to be entirely believable that he implemented it, because if I could do it with almost zero experience working with C, I'm sure he could have done it - though I doubt it was ever released to the public.

Oh, by the way, I'd like to introduce this capstone project - an implementation of the "Trusting Trust" compiler backdoor.

My Implementation

Original rough plan

My original plan was to add the following process to TCC to create EvilTCC:

  1. Look for a "signature" within the source code TCC is compiling to determine if it's compiling itself. If not found, jump to step 5. Otherwise, continue.
  2. Look for some pre-defined place to use as an injection point.
  3. Somehow determine if the code implementing this process has been added to the injection point. If it was, jump to step 5. Otherwise, continue.
  4. Somehow inject the code implementing this functionality into the injection point in memory before parsing the source file.
  5. Look for a "signature" within the source code TCC is compiling to determine if it's compiling the GNU Nano text editor. If so, keep going. If not, stop here.
  6. Find the code that writes the "Welcome to nano" help message, and somehow replace it in memory with something else (to be determined) before parsing the source file.

Problems with this plan

There were a few major problems with this plan:

  • It's complicated
    • need to first determine if compiling TCC or something else, then find the injection point within the codebase, then determine if it's already injected to decide whether or not to inject.
  • Lot's of "somehow"s, and no idea what the "how" was.
    • Replacing the in-memory representation of a file seemed like something that could be done, but not something I had any idea how to do
    • No idea how the injection code could inject itself
  • I don't know C, or how either the TCC or Nano codebases are structured
    • I can learn new things, and I am willing to take on the challenge, but am I able to?
  • While I thought implementing this would be challenging, fun, and educational, I also was reluctant to do so, as it has the potential to be used to do harm.
    • I mean, that's the point of it, which is why I called it "EvilTCC", rather than something like "Totally Innocent and Innocuous Build of TCC with Nothing of Note That's Off About It."
    • Even though I have no plans to release it, I felt guilty even making it - this is a tool that has no benign usage that I can think of.
    • The best-case scenario if actually released: it is only used for harmless pranks on C developers

Actually Implementating EvilTCC

Changing plans during actual implementation

When I went to attempt to implement step 1, it occured to me that if the "signature" contained the injection point, then once the code was injected, it would no longer be seen as a match. This meant that I could greatly simplify the implementation plan, by collapsing steps 1-4 into a single step, and 5-6 into a single step, making it take only 2 steps. The new plan was as follows:

  1. Somehow look for the "signature", and if found, overwrite it in memory before parsing the file for compilation with the code implementing this
  2. Somehow look for the code that writes the "Welcome to nano" help message, and somehow replace it in memory with something else (to be determined) before parsing the source file

With that, I had a workable, broad 2-step plan. The problem is that those 2 steps were each just massive "somehow"s, and I had no clue how I could do the first.

Breaking Down the "Somehow"s

In order to address this problem, I had to break down the step 1 into smaller sub-steps, and tackle those one-by-one. I figured that once I'd done that, step 2 would be relatively trivial.

A list of sub-steps I came up with for step 1:

  1. Step 1.1: Determine Signature to Use
  • Determine a signature/injection point within the TCC source code
    • Ideally, this would be a code snippet is involved in setting up files for parsing/compiling, or validating source code before compiling
  1. Step 1.2: Signature Detection Function
  • Create a function that can detect the signature, and create a program that prints a list of files that it returned a truthy value for
  1. Step 1.3: Detect if Compiling Unmodified TCC
  • Manually modify the code at the signature/injection point to call the function created in sub-problem 2, and print a message whenever a match is found
    • a consequence of this is that the signature will no longer be matched, but that's fine
    • to test that this worked, build a modified tcc binary, then use that to build the unmodified source.
      • This is going to become a common thing
  1. Step 1.4: Replace File Content in Memory
  • Create a program that outputs the content of a file, but replacing the signature in the output with something else
  1. Step 1.5: Make Self-Perpetuating Backdoor
  • Manually modify the code at the signature/injection point again, this time, using the technique from sub-problem 4 to perpetuate this change
    • This turned out to be its own rabbit-hole to solve, and required splitting into "sub-sub-problems", if you will. More on that later.
    • This was also difficult to test in practice
      • the point was to exclusively make a change to the binary that silently perpetuated itself, so the output was supposed to look identical to the unmodified version
Step 1.1: Determine Signature to Use

I decided on using the check_func_return from tccgen.c, a function used to validate that functions returned valid values. The function is as follows:

static void check_func_return(void)
{
    if ((func_vt.t & VT_BTYPE) == VT_VOID)
        return;
    if (!strcmp (funcname, "main")
        && (func_vt.t & VT_BTYPE) == VT_INT) {
        /* main returns 0 by default */
        vpushi(0);
        gen_assign_cast(&func_vt);
        gfunc_return(&func_vt);
    } else {
        tcc_warning("function might return no value: '%s'", funcname);
    }
}

This was a perfect choice, because it is obviously called in the ideal point in the compilation process. In fact, it's so obvious, I don't need to check if it's true, it's just so obvious.

On an unrelated note, here's a link to the Wikipedia page about foreshadowing.

I copied the function definition into a file simply called signature, and then ran 'xxd -i signature > tcc_signature.h to create a header file that defined the signature and could be imported.

I then created a simple program which could take a list of filenames, and print their names if they had that function, and another simple program that simply imported tcc_signature.h and printed the signature.

Things were going smoothly.

Step 1.2: Signature Detection Function

This was fairly straightforward - the only challenge was my nearly-nonexistent C experience. Before this point, I had only worked with 4 C programs:

  • the classic Hello world
  • a C program that makes 2 direct syscalls, to prototype the functionality of my tiny-clear-elf project
  • a C program implementation of colortest, a hobby project in which I rewrite a fairly simple program in as many different languages as possible
    • the program in question generates a specific a "color test" pattern using color control ANSI escape sequences.
      • If I have enough of an understanding of a language to both print arbitrary text, and create nested for-loops (C or Python style), I can implement colortest.
  • a program that called the crypt(3) function, used to create password hashes, with a specific algorithm and salt, to verify that I had properly created a hash for a password consisting of a single lowercase e - it's a long story.

I had also made a patch for a patch for strace, though it was a trivial change, so I don't think it counts.

I'd worked with C++ for a single class in my first semester of college, but I don't really remember much, and when I program, it's usually simple Python or shell scripts that automate the use of existing tools.

As a result, the challenge here was learning enough about C to get it working, and after that, I had very little trouble implementing it.

The only trouble was that the program that I wrote to use the proof-of-concept function would segfault if the function was called on a non-existent filename. I didn't think much of it, as by the time my injection point was called, the file was already known to exist. I put it out of my mind, as there was obviously no way that would cause me any problems in the next step.

On an unrelated note, here's a link to Dictionary.com's page for the word foreshadowing.

Step 1.3: Detect if Compiling Unmodified TCC

This step was significantly more challenging than the previous one. While I had, in a fairly short amount of time, managed to build up enough understanding to complete the previous step, I had done so in no small part by avoiding many of C's rough edges, and doing things the fast way.

Thinking about it more at this point, I realized that my signature-detection process would need to be called before parsing begins in order to enable the signature to be replaced by the injected code. Unfortunately, I realized that the signature chosen in step 1.1 was called well into the process of parsing the file, so it was not going to work there. I had to find a new signature to use.

I went looking through TCC's source code again, and in the same file, I zeroed in on the function gen_inline_functions. Sure, I did not have enough of an understanding of TCC's internal structure to fully understand what it did and how it did it, but it was clearly re-writing the structure of the code in memory, as part of preparing the compilation process. That was obvious, and I figured that I had found the ideal injection point. It was obviously the ideal choice. Yes, so obvious.

Of course it did not work. In fact, it could not work - it was invoked far too late in the compilation process to be workable.

So for the third time, I went looking for an injection point.

To make a long story short slightly less long, I started fresh, and went back through the previous two steps, this time targeting tcc_open_bf. This function is used to prepare the buffers TCC loads code into before compiling it. It takes the filename as an argument, meaning it's much simpler to use, as I can simply pass the filename along as an argument to the signature detection function.

I implemented it without much trouble, and had a modified version of TCC that would print a message if a file contained the signature within it before parsing it. It's perfect! It's obviously the end of these woes, and there's no way I'll need to retool things yet again. Obviously.

On an unrelated note, here's a link to the TV Tropes page about foreshadowing.

I set up the tcc_open_bf function to call the sigdetect function on the filename it was given, and print a message if it found a match.

I thought it would work, and it did in some cases, but in other cases it would segfault.

I spent quite a bit of time trying to figure out why, until I added a line to tcc_open_bf to print out the filename. As it turns out, it's called in contexts other than preparing to open files, and in those contexts, the "filename" parameter is invalid. To address this, I added a check to ensure that the file exists to the signature detection function.

Step 1.4: Replace File Content in Memory

At this point, I am going to go significantly lighter on the details of my implementation, as I do not want to be responsible for anyone else using this kind of thing for malicious purposes.

Without getting into too much detail, I figured out a way to change the content a file descriptor points to, without changing the original file at all, and used that to overwrite the signature with the substitution.

Step 1.5: Make Self-Perpetuating Backdoor

While in the early stages of this, I realized that tcc_open_bf was not a great injection point, as it was called before the file was even opened, so it could not be overwritten in memory, as it was not yet in memory. I found a new injection point this time, the tcc_compile function, which was the one that actually worked.

At the same time, I realized that for the code to be able to inject itself, it would need to be able to dynamically regenerate its own code. It could not simply contain itself. I would need to make the injection process a quine.

A quine, for those who don't know, is a program that outputs its own source code, without ever reading the source file.

I had heard of quines before, but had never implemented one. I still haven't - I found one online, and modified it to suit my needs.

Step 2: Miscompiling Nano

For the second step, "miscompiling" Nano, I simply use the same technique I figured out in step 1.4 to replace "statusbar(_("Welcome to nano. For basic help, type Ctrl+G."));" with "statusbar(_("Your nano has been hacked by an evil compiler."));" before compiling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment