vladshablinsky/formula_renames.md

## formula_renames.md

      
    Raw
  

              formula_renames.md
            
          
    Formula renames: allow any type of renames

I suggest you reading this and not going deeply into the details at first, because several approaches are described here and something can be wrong at some point and then another approach is introduced.
Metadata

Let's describe core formula renames at this point and then try to add same functionality to taps.
Let's have the following metadata for formulae:


We assign a commit to every installed package. The commit is the last commit applied to formula file (something like git rev-list -1 origin/master path/to/formula


Every formula has its own file with information about renames for this name. Deletions are also stored in that files.
Let's take A as an example formula. Every entry in this file for A renames has structure like this: commit_i newname_i. This shows that A was renamed to newname_i at commit_i, where commit_i is a rename_commit~1
Let's look at this file for Arenames closer:
# A-renames
94156b595b96008d47f7b5b37851fd178c9d00a7, A_1
7f979528dea723c1a1aae4ca31c1689df6df1703, A_2
f3cf43acfc764acba84fa20b4c42a3e0b5382589, A_3
b0cfecff16cf8bd04a544f19f69ed363126b3dae, ""  #<--- remove formula after commit
10c8b45678a446b759677294765a3505ba9b20f5, A_4
#           \                              /
#            \                            /
#             \               new name after that commit
#              \
#    commit before rename commit

What information can we get from this file? We know, that A was renamed to A_1, then A was renamed to A_2, which means, that either something was renamed to A or new A was created before next rename of A. Important note here is that entries are sorted in this file. It's the same if we say that we add new entries to the end of file each time. It can guarantee, that commit at line i always goes before commit at line j if i < j.
We suppose that at each moment of time or it's better to say at each commit, we have only one formula with particular name.


We have that structure, what's next? Let's suppose we have formula A installed at some commit 0000001, and this commit is an ancestor (not necessarily direct) of 94156b59, let's get new name of A package now. How do we do this? We go to A-renames file and try to find the first line, for which the commit will be a descendant of 0000001. We get the name of A -- it is A_1 at commit 94156b59. Next thing we do is going to A_1-renames file and doing the same procedure (find the first line, for which the commit will be a descendant of 94156b59). We continue this process untill we can't find a descedant or file with renames doesn't exist, which means, that that formula with that name has never been renamed before. So, we can resolve formula at each point of time.
Note: Binary search can be used here to find the first line, for which the commit will be a descendant of some_commit.
Pitfal: We probably need a way to control the size of metadata, but currently for tap migrations and formula renames there is no such way and I think it doesn't exist. The summary number of lines in all the renames files is the number of all the renames and deletions of formulae in repository.
Dependency resolution

                                         F ... F*
                                        /
       /-------> A ... A* -----------> D ... D*
      /                                 \
     /                                   G ... G*
    /
   X ----------> B ... B* -------> E ... E*
    \
     \ 
      \
       \-------> C ... C*

Let's try to resolve dependencies for X. A very important note here is that at the moment of the last commit changing X it had all the dependencies resolved. It means, that if X depends on A at commit_1 and if A changes name after commit_1 we need to find the first rename after commit_1 in A-renames file. Then we find the commit at which A changes name and it's A's next name, and continue searching in renames files A -> A_1 -> A_2 -> ... -> A* After that we go to the next dependency of X, which is B, resolve it and get B*, then to C, and then resolve dependencies of dependeicies. At the end we have our dependencies resolved. If at some point formula is removed, then we can't install package.
brew update and migration process

Let's difine migration mechanism that we have now at Homebrew as Current Migration Mechanism (CMM)
What can change on brew update if speaking about formulae?

multiple renames (A -> B -> C)
deletions (A -> "")
new formulae (A -> "", new A`)
loop renames (A -> B -> A)

At each moment of time (commit) we have only ONE formula with some name and there is only one way to resove every formula.
first case: suppose everything was clear before the update (no symlinks in the cellar, every installed package has didn't need to be migrated)

I    [#] A    \
N    [#] B     |
     [#] C      >  installed and present in report
     [#] D     |
R    [#] E    /
E    [ ] F
P    [ ] G
O    [ ] H
R    [ ] I
T    [ ] J

Migrations possible only for the formulae that are installed present in report. Not all of these migrations can be completed successfully.
What should we do to perform migrations:

Resolve every name and get last names for each package or find out that package is removed.
Find all the conflicting names e.g. A -> B, B -> A, we can't have them both migrated using CMM, because we can't have
A and B can't be both directories and symlinks. (At this point I don't solve this problem, I just show, that it can happen, let's say, that A and B are BROKEN, and later introduce FIX operation and show how we can handle this situation).
For all packages without conflicts we can use CMM.

second case: some of the packages have wrong names: i.e. they didn't migrate at some update before and need to be FIXed. Some symlinks can present in the cellar, which means that some migrations were made in the past.

We have some packages installed in the cellar and we have report. Let's look only on those pacakges, that are installed and present in report. There can be some broken packages that are not in report, but we don't handle this case at this point.
Let's suppose A failed to migrate at some point. A -> X is in report and A is installed, it can be not the same A as installed. Suppose A failed to migrate to B, so in A-renames we have at least some_commit, B. Let's try to resolve formula for out installed A packaged. For that purpose we have commit_installed_A associated with every installed package, so we try to find in A-renames first commit that is after commit_installed_A. Then we continue resolving the formula as we did it before (looking through renames files), at the end of the procedure we have some name A*. We do the same procedure for all the formula that are installed and present in report.
Some cases possible here:

A -> X -- A has nothing to do with installed A, installed A was renamed at some point and then new A appeared and it's new A that is renamed to X in the last update.
A -> ... -> A package probably need to be updated but it don't need to migrate it.

Basically we do almost the same as in the first step. We migrate only those packages that we can. We can migrate package that was migrated before if there is no conflicts. E.g. A -> B and we have A linked to B in the cellar. At the last update we have B -> C. After rename we'll have B linked to C and A linked to C.
Pitfall: What if A -> B migration failed, then we were able to install B, so we had A and B installed, and A moved to B, but we didn't perform migrations for A -- if we try to resolve formula for installed A and B, it will be the same package. A is broken, B is not.
FIX operation

Let's suppose we have a broken package. What does it mean?
The package is broken, if formula for installed package A is not A. (Package can be removed or renamed, but still present in the cellar).
If package is broken, we can't just migrate it using CMM. What if we remove this package and reinstall it with new name? Something that depends on it can be broken, so, we need to reinstall it as well (by the way, the software that depends on the package we're reinstalling can also be renamed). What we do here is reinstall all the packages in the dependency tree starting from leaves, for example, if we need to fix X:
                                  <--- F
                                 /
       <-------- A <----------- D 
      /                          \
     /                            <--- G
    /
   X <---------- B <------- E
    \
     \ 
      \
       \<------- C

We find every formula that depends on X. We can see, that A, B and C depend on X. A note here: something can dapend on X if something_commit < X_commit, where something_commit is commit of last change to the formula and X commit is a commit assigned to a package when it was installed. We do the same procedure for A, B, C. At the end we have leaves F, G, E, C. We can sort this tree topoligically and reinstall every package in that order. For our example it would be F, G, D, A, E, B, C, X. We reinstall every formula from this list moving from left to right and at the end of this process we have F, G, D, A, E, B, C, X fixed.
Note: If Y is not broken, then fix for this package doesn't change it and fixes only formula that depends on Y.
###Deletions
If something is deleted, we don't break the software that depends on it, but the package, that is deleted is broken and need to be FIXed.
###Pitfal:
We have X installed. It's broken, and was renamed to Y at some point. If X is broken it means, that for some reason we can't migrate it using CMM. What if we're able to install Y without fixing X? We have two same packages, but A haven't been migrated yet. We need to handle this case too.
Alternative approach

What if we have Cellar and Storage. Cellar has only symlinks to Storage and Storage store packages.
For example we have A and B installed.
Different packages have different names in Storage, even package that had name X at some point is stored not in the same directory as the package that has name X now.
Cellar               Storage
    A ---------------> A_abacaba
    
    B ---------------> B_0101001

If we rename A to B, after rename we have
Cellar               Storage
    A ------\          A_abacaba
             \
    B         \------> B_0101001

If there is no links to something in the Storage, it means that they can't appear in future, because if new A appears, in storage it will be with new suffix like A_131232.
Pros: easy migrations, easy deletions.
Cons: major change of the structure.
To think


How to test
Taps
Some other edge cases