Som1Lse/build_system_bug.md

## build_system_bug.md

      
    Raw
  

              build_system_bug.md
            
          
    A tale of a build system bug

Prologue

So, for a long time, I've had issues with GCC on Windows. It worked fine, on the first build, but whenever I made changes and wanted to build it again, I would get the following error message
ninja: error: FindFirstFileExA(c/:/path/to/project): The filename, directory name, or volume label syntax is incorrect.

and had to delete the contents of the build directory and build it again from scratch.
Obviously, that was too much of a hassle, so I just used MSVC and ignored it. Recently, I decided to look into it again, and hopefully fix it.
Before we start, a quick note about my toolchain: I am a stubborn kind of guy, so I insist on building GCC and related tools myself. Partly because I want to be able to use the most recent version, and partly because it is fun. It is less fun, though, when you have to track down these sorts of issues.
I use CMake and Ninja as my primary build system. Just is with the compiler, I build them myself from source.
Act I: A Ninja on the wrong path

We can tell, from the error message, that Ninja tried to call FindFirstFileExA with a malformed path. Apparently, the drive letter, C:, was split into two directories C and :, which, obviously, didn't work, and Windows rightly complained. For some reason, this only happened when using GCC. Google yielded nothing, so I had to investigate myself.
The first step is always to isolate where the issue happens, so I went to figure out exactly which files needed to be deleted, from the build directory, in order for a build to work. I simply started deleting files until I found the culprit, which turned out to be a file called .ninja_deps. It is a binary file, but opening it in a hex editor revealed it contained malformed paths. When I deleted the file, Ninja obviously couldn't read it, and hence didn't get any malformed paths which then triggered the error.
Unfortunately, deleting the file causes Ninja to do a full rebuild. It is helpful to know why:
Build systems, like Ninja, try to only recompile source files that have actually been changed between builds. This includes when a header is included by a source file (or transitively by header file included by the source file). Because of this, the build system needs to know the dependencies of each file in the project. In Make, this is the file names after a colon, so
main.o: main.cpp foo.h bar.h
means the object file main.o depends on the source file, main.cpp, and two header files, foo.h and bar.h. Whenever any of these files are changed, the object file needs to be recompiled.
Unlike Make, Ninja does not have these dependencies hardcoded into the build.ninja file. It also doesn't understand C++ and hence isn't able to parse the source files to find out. Luckily the compiler knows how and provides a way for Ninja to ask it. For GCC this is the -MD switch, which outputs Make rules like the following:
main.o: c:/path/to/project/main.cpp c:/path/to/project/foo.h \
    c:/path/to/project/bar.h
The paths are because in my case the build directory is on a different drive, but even if I had used a build subfolder, the system header files would have an absolute path anyway. A backslash is used to split the rule across multiple lines, so it is still readable.
This is how Ninja knows the dependency graph of your project. Whenever it builds a source file the first time, it asks the compiler which header files it included (this can be done during the compilation). The compiler writes this to a file, which Ninja then parses, and stores in the .ninja_deps file. Whenever you ask it to build the project again, it will use the information stored there, to only rebuild the files have been changed.
A nice feature of Ninja is, you can pass -nv, to tell it to print the commands it would run (-v), but not actually run them (-n). We can then copy the command and run it manually, and inspect the output. Doing this, we can look at the Make rules generated by GCC:
main.o: c\:/path/to/project/main.cpp c\:/path/to/project/foo.h \
    c\:/path/to/project/bar.h
So the issue lies somewhere in GCC. Presumably, because GCC isn't exactly designed for Windows, and Windows paths are quite different from POSIX paths (which GCC is designed for), somewhere along the way, the path gets mangled. Now I just have to find the code that does this, isolate it, produce a good test case and submit a bug report to be fixed.
Act II: An unfamiliar source

Since I compile GCC myself, I have the exact source code used to build it. Now, GCC's source code is kind of hard to read, if you are not familiar with it, which I am not, but I eventually tracked down the correct file libcpp/mkdeps.c. (This is actually a C++ file. Transitioning projects from C to C++ results in funny things like this.)
Whenever a new header file is included, a function named deps_add_dep is called, which adds the path to a vector. At the end of preprocessing, the deps_write function is called, which formats the output and prints it.
Before the path is added to the vector, a function named apply_vpath is called, which seems like a good candidate for our problem function. It has to do with the VPATH environment variable, which is a funny GNU Make feature, essentially, it specifies a list of directories to look for files in. apply_vpath applies this process in reverse, so if the header path starts with a path in the vpath vector, that bit is removed before being added to a vector. It also removes leading ./.
Funny thing is, I couldn't find out where the vpath vector was filled in my manual static analysis of the code, so at this point, I decided to run it under a debugger: I ran GCC under trusty ol' x64dbg, set a breakpoint in apply_vpath and... nothing. It didn't even trip.
At this point, I remembered that GCC is divided into separate front and back ends. When you run g++, that is just the front end, but the actual work happens in the back end cc1plus, which is a separate program, and the debugger was only attached to the front end. A quick google search led me to a plugin for x64dbg, which automatically attached a new debugger to every child process. The installation was quick and rather painless. I set the breakpoint in the child process and... it tripped.
The input path started with C:/, it had not been garbled yet. I stepped through apply_vpath, one instruction at a time. I reached the end, and it had done bugger all to the path. Still as unmalformed as ever. apply_vpath was not to blame, not too surprising since the vpath vector was never filled. I instead turned my attention to deps_write.
deps_write calls a function named make_write (since it is printing the paths in a Makefile-format). It calls a function named make_write_vec, which calls a function named make_write_name for each header path. This function calls yet another function named munge on the path.
Turns out munge is a very apt name for the function. It munges the path, by looping over the string, and when it encounters certain characters, it escapes them. Characters like space, obviously, since otherwise paths with spaces would not work, but also #, the comment indicator in makefiles, backslashes, and, yes, as of a change included in GCC 10, colons. We have found the cause of our problem.
Act III: The solution

Now that we have found the cause, we need to determine how best to go about fixing it. Obviously, GCC should fix their stupid code, and be cross-platform instead of stuck in their own little world, so munge needs to #ifndef _WIN32 the colon handling, since colons are a part of paths there. Right?
Well, no. Presumably (Chesterton's fence) there is a reason colons are escaped. Hence, build tools, like Ninja, would still need to handle escaped colons on other platforms, where colons can still appear in paths, they are just a regular character in a filename. If they have to deal with this everywhere, that just makes code more uniform. This is a general rule of thumb when it comes to cross-platform software: The less special-casing, the better. So, the onus is on Ninja to fix their dependency parser.
At this point, I realised something: Ninja had normalised the paths, replacing backslashes (which are Windows' directory separators) with forward slashes (which everything else uses, and Windows also supports), which had confused me into thinking the drive letter had been split into two directories. Hence all my previous Google searches had been about this, not escaped colons. Searching instead for "gcc escapes colon" yielded a pull request for Ninja, which was merged in May, shortly after the release of GCC 10.
Turns out I last built Ninja on the 16th of August, three months after the patch had been merged, but I build it from the release branch, not the master branch. It was only merged into the release branch on the 18th of August, with the 1.10.1 update. Two bloody days after I built it.
So I simply built Ninja again.
Epilogue

After such a trip, it is worth looking back and seeing how things could have been made easier. Ultimately all the pain came from a mistaken assumption: The drive letter was split into two directories, while in reality a colon had been escaped. This could have been caught fairly early when I inspected the output of -MD, where the paths clearly have backslashes, not forward slashes. If I had realised this earlier, I would have found the pull request much sooner, and wouldn't have had to go down a long, wrong, rabbit hole.
At this point the author, that being me, should blame themselves and conclude that, next time, they should not jump to conclusions so quickly. Next time, they should check their assumptions at every step. That is how such a blog post is supposed to end, but ultimately, I don't think so in this case. It is worth remembering that we make our assumptions for a reason: They are often correct. It is easy to conclude "don't make as many assumptions" when confronted with a case like this, and not take into account all the cases where those same assumptions saved you a lot of time, and "don't make mistaken assumptions" is trivial, and completely useless.
Instead, I think it is important to remember that our assumptions can be wrong, and, when presented with evidence that they are, we should reconsider them, which is exactly what I did. I wrote this down, in part because I think the story is funny, and I hope you had a laugh (especially if you already knew what the problem was) at my expense, in part because it details how you might start to tackle solving a bug like this, which is rooted in large foreign codebases, but I especially wrote this because it has an important moral: Sometimes, no one is at fault (including you), and you just get unlucky.