fjahr/bitcoin_core_gitlab.md Secret

## bitcoin_core_gitlab.md

      
    Raw
  

              bitcoin_core_gitlab.md
            
          
    Bitcoin Core backups on self-hosted GitLab

Introduction

This document shows how a self-hosted GitLab server can be used as a backup
for the Bitcoin Core repo in GitHub.com. The backup can be interacted with
after it is finished. A self-hosted GitLab instance could be used for further
development if there would be issues with the repo on GitHub.com.
Running a backup server

System Requirements

Gitlab lists their minimum recommended requirements for a self-hosted instance
here. The tldr for up
to 500 users: 4 cores and 4 GB RAM. For storage we recommend 20GB minimum, Gitlab
alone recommends 10GB and system and repo requirements are added on top of that.
Aside from the the requirements for collaboration, it also helps with the import
speed if the machine is a bit beefier. On a machine with double the specs described
above we have seen import times of ~36 hours.
Gitlab supports a list of operating systems,
Ubuntu and CentOS seem to be the preferred choices.
System settings

Aside from the general configuration of the GitLab instance the following needs to
be ensured on the GitLab server to minimize the chance of failures during the
import process:

Make sure that the github_import_extended_events is disabled
on your instance globally and, if not, set it to disabled.

irb(main):001:0> Feature.enabled?(:github_import_extended_events)
=> false


It is recommended that the import runs on the weekend because the import
runs a very long time, depending on your hardware probably 24-48 hours, and
the import might fail if something is deleted during the run of the import.
The reason for this is that the importer might encounter a dead link and this
will cause a failure. It seems that mostly pulls/issue that are obvious spam
can be removed by GitHub unprompted, so our best bet is to do this on weekends
when this is less likely to happen. If your import fails and you see something
like 404 - Not Found or a similar error code in the import history, this is
probably what happened and, unfortunately, you will have to delete everything
and start over. There is no way to continue the import from where it stopped.


Increase the number of sidekiq workers
for the importer. The default is just one sidekiq worker and this slows down
the import significantly. Four workers are recommended, if you can add more
depends on your hardware. The current setting we are using is two dedicated
workers and two general workers:


sidekiq['queue_groups'] = ['github_importer', 'github_importer_advance_stage', '*', '*']


Configure reduced GitHub API objects requested per page.


Ensure that the Maximum import size is over 2 GB (Admin panel -> Import and export settings).


If you want to periodically remove the imported project and then reimport
it will have to set the deletion period for projects to 0 so they are actually
deleted immediately. Admin Area -> Settings -> General -> Visibility and access controls -> Deletion protection.


Running the import

The import can not be triggered via the UI because it only allows to trigger
imports of repositories that you own but not public repositories. You need to
use the REST API instead:
curl --request POST \
--url "https://gitlab.sighash.org/api/v4/import/github" \
--header "content-type: application/json" \
--header "PRIVATE-TOKEN: <gl-access-token>" \
--data '{
    "personal_access_token": "<gh-access-token>",
    "repo_id": "1181927",
    "target_namespace": "bitcoin",
    "optional_stages": {
      "single_endpoint_issue_events_import": true,
      "single_endpoint_notes_import": true,
      "attachments_import": false,
      "collaborators_import": false}}'

See also the documentation here
but particularly the optional_stages need to be set exactly in the way above
to prevent a failure of the import (see some additional documentation here,
"additional things to import" refers to the same options). Missing the collaborators
import is unfortunate, however this GitLab functionality is build with companies
in mind that actually have full control over their contributors accounts (see
also the Limitations section for further info on this). The gist attachments
feature doesn't seem to be used much, if at, from what I can tell. If I am
mistaken here, please let me know so we can try to figure out a solution for
this.
Note that, if you had the project imported before, you first need to delete it.
That can be done via UI or API:
curl --request DELETE --header "PRIVATE-TOKEN: <your_access_token>" \
     "https://gitlab.example.com/api/v4/projects/<your-project-ID>"

The ID of the project will be returned from the import call but
you can also use the project path instead with should probably remain consistent.
Note that during the import the project is not usable on GitLab at all! Users
can not look at it, interact with it or run an export of the data.
While the import is running you can see it on the import history page of you instance
but unfortunately there seems to be no way to get an indication what the
progress is and how much longer the process will take.
Continuous import

The GitLab API also provides endpoints for checking an import status and exporting projects. So it is possible
to run a script that triggers the import from github continuously and downloads
each new successful export before wiping the server and starting the import again.
This should allow for having a fresh export backup every 2 days if the import is
successful each time. But keep in mind that this also means that the project
on the GitLab server is completely unusable since it can not be used while the
import is running.
A draft of a script for this can be found here
but it is untested/WIP.
Limitations of the data transferred

GitLab is able to create user accounts in the backup server based on the users
active on GitHub and then link all their activity correctly to that account. This
even allows the former GitHub user to join the GitLab server later and inherit
the account with it's activity and continue working with it as before. However,
GitLab is only doing this if the GitHub user has made their email public. If
the email of the account on GitHub is private (as is the case for most Bitcoin
Core contributors) then this will not work. The user account will not created
and the activity will not be linked to it. This can also be done retroactively.
Instead the contributions will be assigned to the administrator account that
triggered the import and it will have a note at the top which indicate which user
has made this contribution originally (see example below). This means for those
users that don't have their email set to public, the switch would not be as
frictionless as for the users that do have it. However, this seems manageable.

Mirroring feature

Originally an idea of this experiment was to leverage the GitLab mirroring feature
to have real-time, or close to real-time, updates in a leader-follower setup. This
is not supported from GitHub.com to GitLab currently though, only the other way
around and between GitLab instances. It seems doable to build this but we would
need to maintain both the code and infrastructure for it.
However, would we switch the project to GitLab, we could use this to have
self-hosted follower instances that are up-to-date with the main site.
Brink backup server

Brink hosts a server that is reachable via https://gitlab.sighash.org which will
regularly backup the GitHub repo on GitLab. The latest bitcoin
core project backup can be seen here
(status 2024-02-26 at the time of this writing). Please provide feedback on
the quality of the data preserved since not everything is working as
originally hoped (see limitations).
The server will run the backups as an import roughly once per week. This seems
to be the only workable solution for now as will come clear in sections further
below.
Exporting the backups

The server can allow users to join and create an export of these weekly backups
so that anyone can store these backups locally and launch an instance from it
when necessary. This would be a more light weight way to participate in this
effort than running a server with which does regular backups as well, though of
course the more people do this the better, so the following information may
help hosting your own instance.