Skip to content

Instantly share code, notes, and snippets.

@araknast
Last active March 27, 2023 19:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save araknast/2308fa58e49112ff112c415f4fb7531a to your computer and use it in GitHub Desktop.
Save araknast/2308fa58e49112ff112c415f4fb7531a to your computer and use it in GitHub Desktop.

A proposal for the Inlang git-sdk

Proposed Features

Lazy loading

Motivation

It is clear that we need some sort of system in place avoid fetching objects from the remote that we do not need. This is partially solved by using shallow clones to avoid fetching history when we are only working on the tip of a branch, but even here we are still fetching a significant number of files that are not necessarily being used by the editor.

Implementation

To resolve this, we exploit the functionality behind partial clones, namely the filter option to the fetch-pack wire command, and promisor pack files. We then use this to implement a sort of 'partial fetch' which, when combined with a pattern based sparse checkout has the effect of fetching only the objects necessary to render the files we are using, and significantly speeding up the cloning process.

Abstraction

To make this easier for the end user, we can expose this functionality as a lazy loading filesystem which functions exactly the same as a normal filesystem, except in its implementation of readFile().

The only modification necessary to readFile() is a hook at the very beginning which calls checkout on the current file before it is opened. Once promisor packfiles are implemented, this will handle the fetching and unpacking of the corresponding object in a way that is completely transparent to the end user.

File Based Authentication

Motivation

When working with large teams, it is often not optimal or even secure for all members to have access to the entire repository. For this reason we implement file based authentication on the server side to prevent users from accessing files for which they have not been granted permission.

Implementation

The simplest way to implement this while maintaining compatibility with Git is to have a modified version of the Git server which allows users the option to authenticate. Each object stored on the server is access controlled, and the server will refuse to serve object to users who do not have permission, i.e. those objects will not be included in the packfiles sent to the user.

Issues

First, users who are not aware of this system will receive what looks like a corrupt repo when they attempt to clone a repo for which they do not have permission to access all the objects. In order for this to work properly it is necessary for the user to run a partial clone of only the files they have access to (or use the lazy fs).

Second, because tree objects contain the name, mode and hash of the files they reference, these attributes will be potentially visible to users even if they don't have permission to access the file (if they have access to the parent tree).

If this approach is taken, these issues should be made clear in the documentation.

Support for Different File Types

Motivation

The Git workflow is very well suited to collaboration, but is limited by the fact that its most powerful tools only work with text files.

At its core, Git is simply a hash based key-value store. Git has notions of hierarchical structure in the form of 'trees' and 'blobs', but it is only because of our familiarity with POSIX style files systems that we associate these with 'folders' and 'files' respectively. A 'tree' is simply an object that is parent to one or more blobs or other trees.

Furthermore the 'lines' of a file are only relevant to Git as far as generating diffs to present to the user, and the compression of packfiles. These 'lines' simply represent the increments in which a blob is modified.

Take a SQL database for example. MySQL handles multiple databases, each of which contains multiple tables. These tables then contain rows, which are the increments in which they are written to. Then for MySQL, our 'trees' are databases, our 'blobs' are tables, and our 'lines' are rows.

Implementation

All that is necessary for Git to manage these files is a hook when staging, to convert the file to multiple Git objects, and a hook when checking out to convert multiple Git objects back to their corresponding binary format. Similar hooks exists in the form of the clean and smudge hooks, however these have files as both their input and outputs. Our implementation would be more powerful in that it would allow for 'cleaning' a file into multiple Git object, and 'smudging' multiple Git objects into a single file.

Finally, in order for the diffs between these binary files to be presentable to the user, we will need allow the end user to define their own 'diff' implementation to support various file types.

Building on top of isomophic-git

There has been some discussion on whether we should continue using isomorphic-git as our backend, or switch to a more performant WASM based backend such as libgit2. I propose we continue using isomorphic-git as our backend. My reasoning is as follows:

  1. Portability, isomorphic-git runs everywhere JS runs. This would allow our git-sdk to run in environments that do not support WASM, such as React Native.
  2. Extensibility, being written in JS allows us to quickly extend isomorphic-git to add features and functionality as necessary. libgit2 on the other hand is written in c, and so even if we were to use language bindings we would need to work with the c codebase to extend the core git functionality (such as promisor packfiles).
  3. Filesystem support, isomorphic-git provides a flexible plugin architecture for filesystem backends which can be accessed with a standard API. With libgit2 we will need to write our own wrapper on top of an existing WASM filesystem solutions, and would be limited to the filesystem types they support.
  4. Ease of development, as mentioned earlier isomorphic-git is written in JS, which means it is much easier to integrate with our existing JS codebase. Furthermore the the isomorphic-git codebase is far more modular than libgit2's, and being written in JS means we are able to easily extend and modify any piece of it's code to work with our application, not just those that are exposed by the api.
  5. We are already using isomorphic-git. This is no small point, switching from isomorphic-git to libgit2 would mean considerable downtime as we build an api on top of the new system. This is time which is taken away from development of new features in the git sdk, as well as the Inlang editor as a whole.

Summary and Roadmap

To summarize, I propose git-sdk should implement the following features:

  1. Standard Git commands for interacting with repositories
  2. Lazy loading based on fetch-pack filtering and promisor packfiles
  3. Lightweight file-based authentication built on top of the existing Git protocol
  4. More powerful implementations of the smudge and clean hooks, as well diff providers to support version controlling diverse filetypes

While I propose we continue to use isomorphic-git when implementing this sdk, most of its features will be built on top of existing Git functionality, so our roadmap will look similar no matter which backend we go with.

For isomorphic-git, the roadmap will look something as follows (note the difficulty assessments in square brackets):

  1. Update isomorphic-git to support wire protocol v2 [medium]
  2. Implement support for promisor packfiles in isomorphic-git [medium-hard (?)]
  3. Implement partial cloning with a filter option to git.clone [easy]
  4. Abstract partial cloning into a lazy fs that is transparent to the user (git-sdk is born) [easy]
  5. Implement smudge, clean, and diff providers to support binary files [easy]
  6. Create a custom server implementation for file based authentication [hard]

If using libgit2, the roadmap would look somewhat similar:

  1. Implement support for promisor packfiles in libgit2 [hard]
  2. Implement partial cloning with a filter option to git.clone [medium (?)]
  3. Abstract partial cloning into a lazy fs that is transparent to the user (git-sdk is born) [easy]
  4. Write smudge, and clean implementations for libgit2 which support multiple object input and output [hard (?)]
    • This could be made easier if we handle this in git-sdk in a way that is transparent to libgit2, but we lose the performance benefits
  5. Create a custom server implementation for file based authentication [hard]

Note that [easy, medium, hard] denote the amount of work involved, not necessarily the difficulty of the problem, that difficulty is based on the assumption that the previous tasks have already been completed, and also that difficulty assesments are my own soely based on my experience working with the respecive codebases.

Miscellaneous Notes

Performance Issues in Isomorphic-git

The largest issue faced by the Inlang editor in its current iteration using isomorphic-git is the time taken to clone large repositories. The cause of this is twofold:

  1. isomorphic-git is significantly slower in indexing packfiles sent from the remote than canonical git.
  2. the implementation of checkout in isomorphic-git suffers from a lack of optimization in determining which files to update. Where canonical git evaluates the files in a single pass, isomorphic-git evaluates them in multiple passes causing considerable slowdown (see this comment and the comments in src/commands/checkout.js).

With considerable effort, this could potentially be improved, but at the moment it makes more sense to focus on optimizing the usage of our Git backed (i.e. partial clones etc.) rather than the performance of the backend itself.

Fork vs. Patch Workflow

Git-sdk should be designed to support both major Git workflows: patch/send-email (Linux, git, sourcehut), as well as fork/PR (GitHub, GitLab, Bitbucket). For this reason our sdk should also include functionality to generate and apply patches, which can then be used in these workflows.

Collaboration and Multi-User Editing

In my opinion this is best done on the frontend with something like Operational Transform, not Git for performance reasons. Once the a files edits have been resolved it can be committed normally to the repo (potentially noting the multiple contributors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment