pyokagan/README.rst

## README.rst

      
    Raw
  

              README.rst
            
          
    Make git-pull and git-am builtins


Abstract
git-pull and git-am are frequently used git subcommands. However, they are porcelain commands and implemented as shell scripts, which has some limitations which can cause poor performance, especially in non-POSIX environments like Windows. I propose to rewrite these scripts into low level C code and make them builtins. This will increase git's portability, and may improve the efficiency and performance of these commands.


Limitations of shell scripts

git-pull is a commonly executed command to check for new changes in the upstream repository and, if there are, fetch and integrate them into the current branch. git-am is another commonly executed command for applying a series of patches from a mailbox to the current branch. They are both git porcelain commands -- they have no access to git's low level internal API. Currently, they are implemented by the shell scripts git-pull.sh and git-am.sh respectively. These shell scripts require a fully-functioning POSIX shell and utilities. As a result, these commands are difficult to port to non-POSIX environments like Windows.
Since porcelain commands do not have access to git's internal API, performing any git-related function, no matter how trivial, requires git to be spawned in a separate process. This limitation leads to these git commands being relatively inefficient, and can cause long run times on certain platforms that do not have copy-on-write fork() semantics.
Spawning processes can be slow

Shell scripting, by itself, is severely limited in what it can do. Performing most operations in shell scripts require external executables to be called. For example, git-pull.sh spawns the git executable not only to perform complex git operations like git-fetch and git-merge, but it also spawns the git executable for trivial tasks such as retrieving configuration values with git-config and even quoting of command-line arguments with git rev-parse --sq-quote. As a result, these shell scripts usually end up spawning a lot of processes.
Process spawning is usually implemented as a fork() followed by an exec() by shells. This can be slow on systems that do not support copy-on-write semantics for fork(), and thus needs to duplicate the memory of the parent process for every fork() call -- an expensive process.
Furthermore, starting up processes on Windows is generally expensive as it performs several extra steps such as such as using an inter-process call to notify the Windows Client/Server Runtime Subsystem(CSRSS) about the process creation and checking for App Compatibility requirements.
The official Windows port of git, Git for Windows, uses MSYS2¹ to emulate fork(). Since Windows does not support forking semantics natively, MSYS2 can only emulate fork() without copy-on-write semantics. Coupled with Windows heavy process creation, this causes huge slowdowns of git on Windows.
A "no-updates" git-pull, for example, takes an average of 5.1s², as compared to Linux which only takes an average of 0.08s. 5 seconds, while seemingly short, would seem like an eternity to a user who just wants to quickly fetch and merge changes from upstream.
git-am's implementation reads each patch from the mailbox in a while loop, spawning many processes for each patch. Considering the cost of spawning each process, as well as the fact that runtime grows linearly with the number of patches, git-am takes a long time to process a seemingly small number of patches on Windows as compared to Linux. A quick benchmarks shows that git-am takes 7m 20.39s to apply 100 patches on Windows, compared to Linux, which took only 0.08s.
Commands which call git-am are also affected as well. git-rebase--am.sh, which implements the default, patch-based, non-interactive rebase of the git-rebase command, is implemented by first calling git-format-patch (a C builtin) to generate the patches of commits to be rebased, and then applying the commits on top of the new head using git-am. On Windows, the execution time of git-am currently takes up around 98% of git-rebase's execution time for non-interactive rebases³. If git-am was made to perform faster, git-rebase would benefit greatly.
Scripting languages introduce more dependencies

Scripting languages, while allowing for faster prototyping of commands, introduce dependencies into the git distribution. The problem is greater with shell scripts which, due to their simple nature, need to call other executables to perform non-trivial tasks. git-pull and git-am combined requires test, printf, echo, sed, tr, grep, cat, mkdir, rm, mv, perl etc. These may seem like common POSIX utilities, but when porting git to non-POSIX platforms, these commands must all be ported and installed on the system in its entirety. Git for Windows, for example, requires around 250MB of disk space for its installation as it includes the MSYS2 shell and utilities, as well as language runtimes for Perl and Tcl/Tk.
Furthermore, should any workarounds be required to support the platform, it must be implemented in all of these dependency libraries and executables. Git, for example, implements a compatibility function for unlink() on Windows that deals with failures that can occur if another process has an open handle on the file⁴. However, this only helps git builtins, and will have to be implemented again on all the other dependency shell utilities. Thus, shell scripts do not benefit from this compatibility function, and will have to workaround file deletion failures separately, leading to lots of duplicated code.
Scripts have inefficiencies

As stated above, porcelain do not have access to the git internal API, and thus have to spawn git processes in order to perform any git-related operation, no matter how trivial it is. In every git invocation, git would have to re-read the user's configuration files, repository index, repopulate the filesystem cache⁵ etc. This leads to a lot of unnecessary IO, which on systems with poor IO performance⁶ will lead to runtime overhead as well.
Git has internal caches for configuration values, the repository index and repository objects. By rewriting the porcelain into C builtins and using the git internal API instead of spawning separate git processes, these caches can be taken advantage of to reduce redundant IO accesses.
Rewriting shell scripts to C

In order to eliminate the aforementioned portability and performance limitations, I propose that these scripts be rewritten into C builtins. The C programming language provides a richer set of data types, and also provides access to git's internal API. Thus, the above problems can be solved:

git is a native executable on Windows, and thus does not require MSYS2 POSIX emulation. Spawning processes will not incur the overhead of MSYS2 fork emulation.
Converting shell scripts into builtins will reduce the git distribution's dependence on other software. In the future, if all core git functionality is converted into C, it would greatly ease porting of git to platforms which only require the core subset of git's functionality.
Taking advantage of git's internal API and C data types can remove the inefficiencies inherent in shell scripts. For example, instead of spawning a git-config process to get a configuration value, git_config_get_value() can be used. This function also takes advantage of git's configuration cache, which means that configuration files only need to be read once to access all of the configuration values, thus reducing IO.

In addition, git-pull and git-am's behavior can also become more consistent with other git builtin through the git internal API. Currently, command line arguments in git-pull and git-am are parsed manually with a while loop and pattern matching. As a result, their command line interface is inconsistent with other git builtins, which use the parse-options API. For example, in git-pull the option --st should thrown an ambiguous option error as it could mean --stat or --strategy, but git-pull.sh is hardcoded to assume --st resolves to --strategy. git-pull and git-am also do not provide usage information with the -h flag. Rewriting these commands to use parse-options would make their command line interfaces consistent with other git builtins.
In order to fully take advantage of the above benefits of implementing git-pull and git-am in C, especially with regards to making the resulting C code as efficient and fast as possible, I propose to follow these guidelines when rewriting the code:

Avoid⁷ spawning external git processes. This is to support systems with high fork() or process creation overhead, and to reduce redundant IO by taking advantage of the internal object, index and configuration cache.
Avoid needless parsing/string conversions by directly accessing the C data structures. Massaging of strings cause too much code churn, so functions that take the richer C data types as input should be preferred over functions that work by parsing string arguments.

Overall, rewriting git-pull and git-am into efficient C code will most likely lead to impressive speedups on platforms most affected by the aforementioned limitations. The prototype rewrite of git-pull, for example, takes only an average of 0.63s on the same "no-updates" benchmark. Compared to the shell script which took an average of 5.1s, this is nearly an 8x improvement in execution time.
On the other hand, the rewritten code should not cause regressions in the command as compared to the original scripts as the time wasted on dealing with the bugs may outweigh any advantages of rewriting these scripts, especially on platforms which do not suffer so much from the limitations of scripts. Therefore, I propose the following additional requirements:

The resulting builtin should not have wildly different behavior or bugs compared to the shell script.
The rewritten code should not be much harder to maintain than the original script.

These requirements leads to some potential difficulties in rewriting git-pull and git-am.
Potential difficulties

Rewriting code may introduce bugs

A dangerous possibility of rewriting code is that bugs may be introduced into the rewritten version without detection. The test suite can help in catching some bugs, however there is still the possibility that the test suite may not be comprehensive enough. Generally, the ideal method of catching errors and bugs is through careful rewriting and review of the code. However, such reviews would require a lot of time and effort. In order to ensure that the rewrite can be completed during the GSoC period, it would be useful if the test suite could provide some sort of guarantee that the rewritten code most probably did not introduce any bugs. For this purpose, code coverage tools can be used to ensure that the test suite for the command tests all major code paths in the rewritten built-in.
We can also avoid introducing logical errors by following the logical structure of the original script in the rewritten code. This means largely doing a line-by-line, one-to-one translation of the shell script into C. For instance, this line in git-pull.sh calls git-merge-base to calculate the point in history in which $curr_branch, the current branch, forked from the upstream remote tracking branch, $remoteref:
oldremoteref=$(git merge-base --fork-point "$remoteref" $curr_branch 2>/dev/null)
Assuming remoteref and curr_branch are strings, the equivalent C code, using the run-command API, would look like:
struct child_process cmd = CHILD_PROCESS_INIT;
struct strbuf oldremoteref = STRBUF_INIT;
argv_array_pushl(&cmd.args, "merge-base", "--fork-point", remoteref, curr_branch, NULL);
cmd.out = -1;
cmd.no_stderr = 1;
cmd.git_cmd = 1;
if (!start_command(&cmd)) {
    strbuf_read(&oldremoteref, cmd.out, 0);
    close(cmd.out);
    finish_command(&cmd);
}
Of course, the downside of following this too strictly is that if there were any logical bugs in the original code, or if the original code is unclear, the rewritten code would inherit these problems too. For example, the code in git-pull.sh assumes in certain code paths that the repository and refspecs to fetch are the only arguments present in "$@". However, the "$@" automatic variable would also contain the command-line options to be passed to git-fetch as well⁸. The rewritten git-pull should, for the purpose of making code maintenance easy, fix these logical errors.
Rewritten code may become harder to understand

For the purpose of reducing git's dependencies, the rewritten C code should not depend on other libraries or executables other than what is already available to git builtins. However, C is a much more verbose language than shell script, and does not provide much advanced facilities for processing text. As a result, text processing functions in shell scripts can end up as huge multi-line functions in C. For instance, in git-pull.sh, by relying on sed and tr, all heads valid for merge from FETCH_HEAD can be retrieved with a single short-and-succinct pipeline:
merge_head=$(sed -e '/  not-for-merge   /d' \
    -e 's/  .*//' "$GIT_DIR"/FETCH_HEAD | \
    tr '\012' ' ')
A strict translation of the pipeline to C becomes something like:
struct strbuf merge_head = STRBUF_INIT;
FILE *fp = fopen(git_path("FETCH_HEAD"), "r");
for (;;) {
    char sha1_text[40];
    char text[128];
    int c, ret;

    ret = fscanf(fp, "%40c %127[^\n]", sha1_text, text);
    if (ret == EOF)
        break;
    else if (ret != 2)
        die(_("error parsing %s"), filename);
    for (;;) {
        c = fgetc(fp);
        if (c == '\n' || c == EOF)
            break;
    }
    if (strstr(text, "not-for-merge"))
        continue;
    strbuf_addstr(merge_head, sha1_text);
    strbuf_addstr(merge_head, " ");
}
fclose(fp);
We can see that the C version requires much more lines compared to the shell pipeline, which will contributes to the "bloating" of the C code. It may even seem to be more cryptic. This is a known trade-off of the C programming language, and not depending on external libraries and dependencies. The best solution is to improve the design of the rewritten code. For example, the above can be moved into an appropriately-named function. The function can then be documented with well-defined inputs, outputs, and behavior. This should help the resultant code to be much more understandable.
Recurring code patterns, once identified, can be codified and implemented in the internal git API itself. This not only helps to shrink the C code, but it can also benefit the rest of the git code base.
Development approach and deliverables

In the first stage (and throughout the course of the project), test suite code coverage for each command is reviewed using kcov⁹. At the minimum, the test suite should cover the major code paths and command-line switches of the script. These tests do not necessarily have to be perfect or comprehensive before the next stage can begin¹⁰, however the minimum required tests must be identified and written in order to give a good idea of what needs to be tested, and provides a safety net for the later stages. Tests will be kept in a separate patch queue as they are applicable whether git-pull and git-am are builtins or not. By the end of the project, the test suite for git-pull and git-am will be more comprehensive.
In the second stage, the shell scripts are translated line-by-line into their corresponding C code. The translation should not be overly strict -- where appropriate the corresponding git internal API functionality can be used instead. For example, in git-pull and git-am the parse-options API can be used instead of manually implementing command-line argument parsing like how the shell scripts currently do. If the shell script has any obvious bugs or errors (perhaps identified in the first stage), they should be fixed in this stage as well. The resultant C code at the end of this stage may not necessarily be efficient, but it should be functionally complete and error free to serve as a baseline product for the final stage. The final product would most probably look something like the prototype rewrite of git-pull, but with more usage of the run-command APIs and without the bugs.
In the final stage, the baseline C code will be optimized for speed and efficiency. For prioritization of efforts, the baseline builtins will be profiled using their respective test suites in order to find hotspots. The builtins will also be benchmarked to compare their performance with the original shell scripts. Since git-pull is just a simple wrapper around git-fetch, git-merge and git-rebase, the benchmark will simply concentrate on run time for common use cases. (e.g. the "no updates" use case). For git-am, the benchmark can rely on its performance when applying a large number of patches (e.g. a patchbomb benchmark).
Should there be no hotspots found, general optimization will be done to reduce overhead. run-command API calls will be replaced with the relevant corresponding git internal API calls. Strings will be replaced with their corresponding git C data types where possible to avoid needless string conversions. For example, SHA1s will be parsed once and stored in unsigned char[20] or sha1_array data types. Some refactoring may also be required. For example, for git-pull to access git-merge-base --fork-point functionality without spawning a separate process, the function handle_fork_point() in builtin/merge-base.c must be made available. The function must also be modified to not call die() on errors. In order to help speed up polishing and code review, this optimization will be done iteratively in small patches on top of the baseline C code produced in stage 2. The final product of this stage, and the entire project, would be a patch series containing the baseline C code as the first patch, and optimizations to it in subsequent patches.
Timeline

Unfortunately, my university term begins on the 3rd of August. On the other hand, though, I am currently not attending university so I'm effectively free before the Summer holidays even begin. As such, to compensate, I will begin coding right after accepted student proposals are announced so that the period for coding is still 3 months.
The timeline is as follows:

Apr 27: Accepted student proposals announced.
Apr 27: Stage 1 begins: Review of test suite coverage.
Apr 27 onwards: Discussion of any bugs or behavior in git-am and git-pull that should be fixed or changed. This will continue throughout the project.
May 4: Stage 2 (Baseline rewrite) begins.
May 4 - May 10 (1 week): Full baseline rewrite of git-pull.sh.
May 11: 1st version of builtin/pull.c published on mailing list for review.
May 11 - May 24 (2 weeks): Full baseline rewrite of git-am.sh.
May 25: 1st version of builtin/am.c published on mailing list for review.
May 25 - Jun 22 (5 weeks): builtin/am.c and builtin/pull.c are iterated on in an alternating fashion to pass code review.
Jun 22: Stage 3 (Optimization) begins
Jun 22 - Jun 24: Benchmarking and profiling is done on the new builtins, especially as compared to their original shell scripts, to check for any performance improvements and to guide optimization. These results are published to mailing list and discussed.
Jun 24 - Jul 19 (3.5 weeks): Stage 3 (Optimization). builtin/am.c and builtin/pull.c are iterated on in an alternating fashion to optimize the code in small steps.
Jul 3: Mid-term evaluations deadline
Jul 20 - Aug 2 (2 weeks): Pencils down -- focus on polishing up builtin/am.c, builtin/pull.c and the tests patch series, and get them to pass code review.
Aug 17: Suggested 'pencils-down' date.
Aug 21: Firm 'pencils-down' date.
Aug 28: Final evaluation deadline

About me

I've been accepted to the National University of Singapore for a bachelors in Computer Engineering, and will matriculate on 3rd August 2015. I've been writing Python, C and C++ for around 4 years, and some of my experiments can be found on my github profile. I have also contributed patches successfully to libuv and node-contextify.
My microproject is to implement XDG_CONFIG_HOME support in git-credential-store. Patch series: v5, v4, v3, v2, v1.


The current Git for Windows is based on MSYS, but shares many of the limitations of MSYS2 as they are both derived from cygwin. This proposal is written with MSYS2 in mind, as Git For Windows will use it in the future.↩
The git-pull from a local remote with no updates. As a benchmark, git-pull was run with this setup 100 times on Windows and Linux boxes with similar specs. git-pull was executed from a native process on Windows to avoid MSYS2's forking overhead.↩
git-rebase was used to rebase 100 commits in non-interactive mode with GIT_TRACE on. The timings were taken from the resulting output and matched up with the code in git-am.sh and git-rebase--am.sh. The total execution time was 8m 0s, of which git-format-patch took only 0.5s, but git-am took 7m 52s.↩
See mingw_unlink() in compat/mingw.c.↩
fscache, currently only implemented in Git for Windows. See commit 99943de.↩
Poor IO performance can stem from the choice of operating system, filesystem as well as the underlying storage medium. Windows, for example, is known to have worse IO performance than Linux, which git was historically primarily written for. Network file systems such as NFS can also cause poor IO performance due to network latency. IO performance of HDDs will be lower than SSDs due to slower random access and higher latency.↩
This requirement needs to be balanced with the fact that significant refactoring of the git codebase may be required to access certain functionality without spawning a separate process. (e.g. ensuring errors are propogated instead of die()-ing).↩
An example of such an assumption can be found in remoteref="$(get_remote_merge_branch "$@" 2>/dev/null)". get_remote_merge_branch assumes that the first argument is the remote name, and will thus fail (silently) if it ends up being arguments for git-fetch.↩
An example kcov code coverage report for the git-pull tests can be found here↩
In fact, improvements to the test suite will most likely proceed in parallel with the rest of the project as the process of studying and rewriting the shell scripts will provide insights on what needs to be tested.↩