Skip to content

Instantly share code, notes, and snippets.

@modocache
Last active August 29, 2015 13:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save modocache/9434914 to your computer and use it in GitHub Desktop.
Save modocache/9434914 to your computer and use it in GitHub Desktop.
GSoC Proposal Draft: Refactor Temporary File Handling in Git

GSoC: Refactor Temporary File Handling in Git

Abstract

Git creates temporary files throughout the codebase. It creates these files to perform atomic operations[1] or to pass them external tools[2]. Once used, Git deletes the files from disk--or, at least, it should. Unfortunately, Git source code does not currently have a unified method of creating and deleting temporary files, and some code paths result in supposedly "temporary" files remaining on disk.

This project aims to unify temporary file handling in Git, ensuring that all temporary files behave in the same way, and that all of them are deleted before the program exits. For forensic purposes, the new temporary file API will also provide users the option to specify that temporary files should remain on disk in the case that Git encounters a fatal error during execution.

Personal Details

  • Name: Brian Gesiak
  • Email: modocache@gmail.com
  • IRC nick: modocache
  • Telephone: +81-XXX-XXXX-XXXX (Japan)
  • Other contact methods: @modocache (Twitter), brian.gesiak (Skype)
  • Country of residence: Japan
  • Timezone: Japan Standard Time (UTC+09:00)
  • Primary language: English

Project Proposal

This project was originally featured on the idea page.

I propose extracting the temporary file handling functions used in lockfile.lock_file and lockfile.remove_lock_file into a separate interface, tempfile.h.

The lockfile functions maintain a linked list of temporary files. The remove_lock_file function is registered as an atexit exit handler. It traverses the list of temporary files and ensures they are all deleted when the program exits.

The project is composed of two main phases:

  1. Extract of the lockfile functions into tempfile.h.
  2. Replace each custom implementation of temporary files in the Git codebase with tempfile.h functions.

Based on my investigation thus far, temporary files are ubiquitous in Git. There are, however, four main implementations. Below are the four functions which create temporary files (in order of frequency of use):

  1. lockfile.lock_file [3]
  2. git-compat-util.odb_mkstemp [4]
  3. git-compat-util.odb_pack_keep [5]
  4. diff.prepare_temp_file [6]

The project will involve replacing each of these implementations in turn. I will add tests as necessary to ensure each of the refactored implementations work as expected.

TODO: Examine the differences between each implementation, and perhaps draft an interface for tempfile.h that takes those into account.

Schedule of Deliverables

  • 4/22: [GSoC] Accepted student proposals announced.
  • 4/22 - 4/28 (one week): Discuss interface for tempfile.h with mentor.
  • 4/29 - 5/12 (two weeks): As a "proof of concept", experiment with replacing implementation of lockfile.c with tempfile.h functions.
  • 5/13 - 5/18 (one week): Present a diff with the "proof of concept" and request feedback from mentor.
  • 5/19: [GSoC] Students begin coding.
  • 5/19 - 6/9 (three weeks): Submit patches for replacing lockfile.c implementation with tempfile.h functions as a RFC. The goal of this project is unifying the implementation of temporary files, but at this point I will have simply moved the implemenetation of lockfile.c elsewhere; that is, no unification will have yet occurred. Therefore, the patches submitted at this point are not meant to be merged, but rather, will be used to gather feedback from the community on the interface and implementation of tempfile.h.
  • 6/10 - 6/23 (two weeks): Submit patches for replacing diff.prepare_temp_file with tempfile.h functions. These patches represent a tangible benefit for the project: two distinct implementations of temporary files will have been unified. My personal goal is to have these patches merged into pu, although depending on feedback and the position of the project maintainer, this might not be feasible.
  • 6/24 - 6/28: [GSoC] Mentors and students submit mid-term evaluations.
  • 6/29 - 7/19 (three weeks): Submit patches for replacing git-compat-util.odb_mkstemp with tempfile.h functions.
  • 7/20 - 8/10 (three weeks): Submit patches for replacing git-compat-util.odb_pack_keep with tempfile.h functions.
  • 8/11: [GSoC] Suggested 'pencils down' date.
  • 8/12 - 8/18 (one week): Investigate and report on any remaining parts of the codebase that do not yet use the unified tempfile API.
  • 8/19: [GSoC] Firm 'pencils down' date.
  • 8/19 - 8/23: [GSoC] Mentors and students submit final evaluations.

Each of the above estimates includes the time necessary to refactor any related areas of the codebase, add tests as necessary, and respond to feedback provided via the mailing list,

Open Source Development Experience

  • Core member of Kiwi, an Objective-C behavior-driven development framework, since 2013. Authored commits here.
  • Submitted small patches to Git and libgit2.
  • Author of several open-source libraries in Objective-C, Python, and Ruby, available on GitHub.

Work/Internship Experience

  • Software Engineer at GREE, Inc. (Tokyo, Japan), 2011 - 2012. Developed mobile and web applications in Objective-C/PHP/JavaScript. Used SVN and Git for version control.
  • Senior Software Engineer at ShopKeep POS (New York, USA), 2012 - 2013. Developed mobile and web applications in Objective-C/Ruby/JavaScript. Used Git for version control.

Academic Experience

I completed a B.A. in Japanese Language and Literature in 2008, and am now a research student with a concentration in parallel and distributed computing at the University of Tokyo. I've only just begun my education, so I'm still in the process of finding a concrete research topic.

Why Me

In an email from 2013, the Git organization administrator for GSoC, Shawn Pearce, wrote:

Git has been involved since 2007. In all of that time we have had very few student projects merge successfully into their upstream project...before the end of GSoC. Even fewer students have stuck around and remained active contributors.

I have several years of experience working on distributed teams and open source projects. I'm confident I can use that experience to ensure that I set realistic milestones that result in code getting merged into Git.

I'm hoping to use GSoC as an opportunity to begin making contributions to Git on a regular basis. The ideas page for this year's GSoC highlights the fact that there are many ways to get involved, and I'd love to do so.

Why Git

I've used Git for many years. Contributing to the project itself has threefold benefits:

  1. I can contribute to the development of a tool I enjoy using.
  2. I can brag to my friends and colleagues, most of whom already use Git, that they're running code I helped write.
  3. Working on the implementation gives me a better understanding of how Git works, and thus more comfortable using it from day to day.

[1] When writing pack files, Git writes to a temporary file first. Once this operation finishes successfully, the temporary file is atomically moved into place.

[2] For example, when displaying a diff using an external tool by running the git diff --ext-diff command, Git creates two temporary files and passes them to the tool. After the tool exits, Git deletes these temporary files.

[3] The function is used in lock_file.hold_locked_index, which in turn is used in countless places, including cache-tree.write_cache_as_tree, merge-recursive.merge_recursive_generic, merge.checkout_fast_forward, and so on.

[4] Used by fast-import.start_packfile, pack-write.write_idx_file, pack-write.create_tmp_packfile, and index-pack.open_pack_file. Each consumer is responsible for removing the file at some point during its execution. For example, pack-objects.write_pack_file creates a temporary file using create_tmp_packfile, and later renames this file to the appropriate destination once writing is complete (although there are plenty of opportunities for the program to halt prior to that point).

[5] Used by fast-import.keep_pack and index-pack.final.

[6] Used only internally by diff.run_external_diff and diff.run_textconv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment