araknast/git-sdk-proposal.md Secret

## git-sdk-proposal.md

      
    Raw
  

              git-sdk-proposal.md
            
          
    A proposal for the Inlang git-sdk

Proposed Features

Lazy loading

Motivation

It is clear that we need some sort of system in place avoid fetching objects
from the remote that we do not need. This is partially solved by using shallow
clones to avoid fetching history when we are only working on the tip of a
branch, but even here we are still fetching a significant number of files that
are not necessarily being used by the editor.
Implementation

To resolve this, we exploit the functionality behind partial clones, namely the
filter option to the fetch-pack wire command, and promisor pack files. We
then use this to implement a sort of 'partial fetch' which, when combined with
a pattern based sparse checkout has the effect of fetching only the objects
necessary to render the files we are using, and significantly speeding up the
cloning process.
Abstraction

To make this easier for the end user, we can expose this functionality as a
lazy loading filesystem which functions exactly the same as a normal
filesystem, except in its implementation of readFile().
The only modification necessary to readFile() is a hook at the very beginning
which calls checkout on the current file before it is opened. Once promisor
packfiles are implemented, this will handle the fetching and unpacking of the
corresponding object in a way that is completely transparent to the end user.
File Based Authentication

Motivation

When working with large teams, it is often not optimal or even secure for all
members to have access to the entire repository. For this reason we implement
file based authentication on the server side to prevent users from accessing
files for which they have not been granted permission.
Implementation

The simplest way to implement this while maintaining compatibility with Git is
to have a modified version of the Git server which allows users the option to
authenticate. Each object stored on the server is access controlled, and the
server will refuse to serve object to users who do not have permission, i.e.
those objects will not be included in the packfiles sent to the user.
Issues

First, users who are not aware of this system will receive what looks like a
corrupt repo when they attempt to clone a repo for which they do not have
permission to access all the objects. In order for this to work properly it is
necessary for the user to run a partial clone of only the files they have
access to (or use the lazy fs).
Second, because tree objects contain the name, mode and hash of
the files they reference, these attributes will be potentially visible to users
even if they don't have permission to access the file (if they have access to
the parent tree).
If this approach is taken, these issues should be made clear in the
documentation.
Support for Different File Types

Motivation

The Git workflow is very well suited to collaboration, but is limited by the
fact that its most powerful tools only work with text files.
At its core, Git is simply a hash based key-value store. Git has notions of
hierarchical structure in the form of 'trees' and 'blobs', but it is only
because of our familiarity with POSIX style files systems that we associate
these with 'folders' and 'files' respectively. A 'tree' is simply an object
that is parent to one or more blobs or other trees.
Furthermore the 'lines' of a file are only relevant to Git as far as generating
diffs to present to the user, and the compression of packfiles. These 'lines'
simply represent the increments in which a blob is modified.
Take a SQL database for example. MySQL handles multiple databases, each of
which contains multiple tables. These tables then contain rows, which are the
increments in which they are written to. Then for MySQL, our 'trees' are
databases, our 'blobs' are tables, and our 'lines' are rows.
Implementation

All that is necessary for Git to manage these files is a hook when staging, to
convert the file to multiple Git objects, and a hook when checking out to
convert multiple Git objects back to their corresponding binary format. Similar
hooks exists in the form of the clean and smudge hooks, however these have
files as both their input and outputs. Our implementation would be more powerful
in that it would allow for 'cleaning' a file into multiple Git object, and
'smudging' multiple Git objects into a single file.
Finally, in order for the diffs between these binary files to be presentable to
the user, we will need allow the end user to define their own 'diff'
implementation to support various file types.
Building on top of isomophic-git

There has been some discussion on whether we should continue using
isomorphic-git as our backend, or switch to a more performant WASM based backend
such as libgit2. I propose we continue using isomorphic-git as our backend. My
reasoning is as follows:

Portability, isomorphic-git runs everywhere JS runs. This would allow our
git-sdk to run in environments that do not support WASM, such as React
Native.
Extensibility, being written in JS allows us to quickly extend isomorphic-git
to add features and functionality as necessary. libgit2 on the other hand is
written in c, and so even if we were to use language bindings we would need
to work with the c codebase to extend the core git functionality (such
as promisor packfiles).
Filesystem support, isomorphic-git provides a flexible plugin architecture
for filesystem backends which can be accessed with a standard API. With libgit2
we will need to write our own wrapper on top of an existing WASM filesystem
solutions, and would be limited to the filesystem types they support.
Ease of development, as mentioned earlier isomorphic-git is written in JS,
which means it is much easier to integrate with our existing JS codebase.
Furthermore the the isomorphic-git codebase is far more modular than
libgit2's, and being written in JS means we are able to easily extend and
modify any piece of it's code to work with our application, not just
those that are exposed by the api.
We are already using isomorphic-git. This is no small point, switching from
isomorphic-git to libgit2 would mean considerable downtime as we build an
api on top of the new system. This is time which is taken away from
development of new features in the git sdk, as well as the Inlang editor as
a whole.

Summary and Roadmap

To summarize, I propose git-sdk should implement the following features:

Standard Git commands for interacting with repositories
Lazy loading based on fetch-pack filtering and promisor packfiles
Lightweight file-based authentication built on top of the existing Git protocol
More powerful implementations of the smudge and clean hooks, as well diff
providers to support version controlling diverse filetypes

While I propose we continue to use isomorphic-git when implementing this sdk,
most of its features will be built on top of existing Git functionality, so our
roadmap will look similar no matter which backend we go with.
For isomorphic-git, the roadmap will look something as follows (note the
difficulty assessments in square brackets):

Update isomorphic-git to support wire protocol v2 [medium]
Implement support for promisor packfiles in isomorphic-git [medium-hard (?)]
Implement partial cloning with a filter option to git.clone [easy]
Abstract partial cloning into a lazy fs that is transparent to the user (git-sdk is born) [easy]
Implement smudge, clean, and diff providers to support binary files [easy]
Create a custom server implementation for file based authentication [hard]

If using libgit2, the roadmap would look somewhat similar:

Implement support for promisor packfiles in libgit2 [hard]
Implement partial cloning with a filter option to git.clone [medium (?)]
Abstract partial cloning into a lazy fs that is transparent to the user (git-sdk is born) [easy]
Write smudge, and clean implementations for libgit2 which support
multiple object input and output [hard (?)]

This could be made easier if we handle this in git-sdk in a way that is
transparent to libgit2, but we lose the performance benefits


Create a custom server implementation for file based authentication [hard]

Note that [easy, medium, hard] denote the amount of work involved, not
necessarily the difficulty of the problem, that difficulty is based on the
assumption that the previous tasks have already been completed, and also that
difficulty assesments are my own soely based on my experience working with
the respecive codebases.
Miscellaneous Notes

Performance Issues in Isomorphic-git

The largest issue faced by the Inlang editor in its current iteration using
isomorphic-git is the time taken to clone large repositories. The cause of this
is twofold:

isomorphic-git is significantly slower in indexing packfiles sent from the
remote than canonical git.
the implementation of checkout in isomorphic-git suffers from a lack of
optimization in determining which files to update. Where canonical git
evaluates the files in a single pass, isomorphic-git evaluates them in
multiple passes causing considerable slowdown (see this comment
and the comments in
src/commands/checkout.js).

With considerable effort, this could potentially be improved, but at the
moment it makes more sense to focus on optimizing the usage of our Git backed
(i.e. partial clones etc.) rather than the performance of the backend itself.
Fork vs. Patch Workflow

Git-sdk should be designed to support both major Git workflows:
patch/send-email (Linux, git, sourcehut), as well as fork/PR (GitHub, GitLab,
Bitbucket). For this reason our sdk should also include functionality to
generate and apply patches, which can then be used in these workflows.
Collaboration and Multi-User Editing

In my opinion this is best done on the frontend with something like Operational
Transform, not Git for performance reasons. Once the a files edits have been
resolved it can be committed normally to the repo (potentially noting the
multiple contributors).