It is clear that we need some sort of system in place avoid fetching objects from the remote that we do not need. This is partially solved by using shallow clones to avoid fetching history when we are only working on the tip of a branch, but even here we are still fetching a significant number of files that are not necessarily being used by the editor.
To resolve this, we exploit the functionality behind partial clones, namely the
filter
option to the fetch-pack
wire command, and promisor pack files. We
then use this to implement a sort of 'partial fetch' which, when combined with
a pattern based sparse checkout has the effect of fetching only the objects
necessary to render the files we are using, and significantly speeding up the
cloning process.
To make this easier for the end user, we can expose this functionality as a
lazy loading filesystem which functions exactly the same as a normal
filesystem, except in its implementation of readFile()
.
The only modification necessary to readFile()
is a hook at the very beginning
which calls checkout
on the current file before it is opened. Once promisor
packfiles are implemented, this will handle the fetching and unpacking of the
corresponding object in a way that is completely transparent to the end user.
When working with large teams, it is often not optimal or even secure for all members to have access to the entire repository. For this reason we implement file based authentication on the server side to prevent users from accessing files for which they have not been granted permission.
The simplest way to implement this while maintaining compatibility with Git is to have a modified version of the Git server which allows users the option to authenticate. Each object stored on the server is access controlled, and the server will refuse to serve object to users who do not have permission, i.e. those objects will not be included in the packfiles sent to the user.
First, users who are not aware of this system will receive what looks like a corrupt repo when they attempt to clone a repo for which they do not have permission to access all the objects. In order for this to work properly it is necessary for the user to run a partial clone of only the files they have access to (or use the lazy fs).
Second, because tree objects contain the name, mode and hash of the files they reference, these attributes will be potentially visible to users even if they don't have permission to access the file (if they have access to the parent tree).
If this approach is taken, these issues should be made clear in the documentation.
The Git workflow is very well suited to collaboration, but is limited by the fact that its most powerful tools only work with text files.
At its core, Git is simply a hash based key-value store. Git has notions of hierarchical structure in the form of 'trees' and 'blobs', but it is only because of our familiarity with POSIX style files systems that we associate these with 'folders' and 'files' respectively. A 'tree' is simply an object that is parent to one or more blobs or other trees.
Furthermore the 'lines' of a file are only relevant to Git as far as generating diffs to present to the user, and the compression of packfiles. These 'lines' simply represent the increments in which a blob is modified.
Take a SQL database for example. MySQL handles multiple databases, each of which contains multiple tables. These tables then contain rows, which are the increments in which they are written to. Then for MySQL, our 'trees' are databases, our 'blobs' are tables, and our 'lines' are rows.
All that is necessary for Git to manage these files is a hook when staging, to
convert the file to multiple Git objects, and a hook when checking out to
convert multiple Git objects back to their corresponding binary format. Similar
hooks exists in the form of the clean
and smudge
hooks, however these have
files as both their input and outputs. Our implementation would be more powerful
in that it would allow for 'cleaning' a file into multiple Git object, and
'smudging' multiple Git objects into a single file.
Finally, in order for the diffs between these binary files to be presentable to the user, we will need allow the end user to define their own 'diff' implementation to support various file types.
There has been some discussion on whether we should continue using isomorphic-git as our backend, or switch to a more performant WASM based backend such as libgit2. I propose we continue using isomorphic-git as our backend. My reasoning is as follows:
- Portability, isomorphic-git runs everywhere JS runs. This would allow our git-sdk to run in environments that do not support WASM, such as React Native.
- Extensibility, being written in JS allows us to quickly extend isomorphic-git to add features and functionality as necessary. libgit2 on the other hand is written in c, and so even if we were to use language bindings we would need to work with the c codebase to extend the core git functionality (such as promisor packfiles).
- Filesystem support, isomorphic-git provides a flexible plugin architecture for filesystem backends which can be accessed with a standard API. With libgit2 we will need to write our own wrapper on top of an existing WASM filesystem solutions, and would be limited to the filesystem types they support.
- Ease of development, as mentioned earlier isomorphic-git is written in JS, which means it is much easier to integrate with our existing JS codebase. Furthermore the the isomorphic-git codebase is far more modular than libgit2's, and being written in JS means we are able to easily extend and modify any piece of it's code to work with our application, not just those that are exposed by the api.
- We are already using isomorphic-git. This is no small point, switching from isomorphic-git to libgit2 would mean considerable downtime as we build an api on top of the new system. This is time which is taken away from development of new features in the git sdk, as well as the Inlang editor as a whole.
To summarize, I propose git-sdk should implement the following features:
- Standard Git commands for interacting with repositories
- Lazy loading based on
fetch-pack
filtering and promisor packfiles - Lightweight file-based authentication built on top of the existing Git protocol
- More powerful implementations of the
smudge
andclean
hooks, as well diff providers to support version controlling diverse filetypes
While I propose we continue to use isomorphic-git when implementing this sdk, most of its features will be built on top of existing Git functionality, so our roadmap will look similar no matter which backend we go with.
For isomorphic-git, the roadmap will look something as follows (note the difficulty assessments in square brackets):
- Update isomorphic-git to support wire protocol v2 [medium]
- Implement support for promisor packfiles in isomorphic-git [medium-hard (?)]
- Implement partial cloning with a
filter
option togit.clone
[easy] - Abstract partial cloning into a lazy fs that is transparent to the user (git-sdk is born) [easy]
- Implement
smudge
,clean
, anddiff
providers to support binary files [easy] - Create a custom server implementation for file based authentication [hard]
If using libgit2, the roadmap would look somewhat similar:
- Implement support for promisor packfiles in libgit2 [hard]
- Implement partial cloning with a
filter
option togit.clone
[medium (?)] - Abstract partial cloning into a lazy fs that is transparent to the user (git-sdk is born) [easy]
- Write
smudge
, andclean
implementations for libgit2 which support multiple object input and output [hard (?)]- This could be made easier if we handle this in git-sdk in a way that is transparent to libgit2, but we lose the performance benefits
- Create a custom server implementation for file based authentication [hard]
Note that [easy, medium, hard] denote the amount of work involved, not necessarily the difficulty of the problem, that difficulty is based on the assumption that the previous tasks have already been completed, and also that difficulty assesments are my own soely based on my experience working with the respecive codebases.
The largest issue faced by the Inlang editor in its current iteration using isomorphic-git is the time taken to clone large repositories. The cause of this is twofold:
- isomorphic-git is significantly slower in indexing packfiles sent from the remote than canonical git.
- the implementation of
checkout
in isomorphic-git suffers from a lack of optimization in determining which files to update. Where canonical git evaluates the files in a single pass, isomorphic-git evaluates them in multiple passes causing considerable slowdown (see this comment and the comments in src/commands/checkout.js).
With considerable effort, this could potentially be improved, but at the moment it makes more sense to focus on optimizing the usage of our Git backed (i.e. partial clones etc.) rather than the performance of the backend itself.
Git-sdk should be designed to support both major Git workflows: patch/send-email (Linux, git, sourcehut), as well as fork/PR (GitHub, GitLab, Bitbucket). For this reason our sdk should also include functionality to generate and apply patches, which can then be used in these workflows.
In my opinion this is best done on the frontend with something like Operational Transform, not Git for performance reasons. Once the a files edits have been resolved it can be committed normally to the repo (potentially noting the multiple contributors).