mcouthon/New_upgrade_mechanism.md

## New_upgrade_mechanism.md

      
    Raw
  

              New_upgrade_mechanism.md
            
          
    New upgrade mechanism

#cloudify
Abstract

This document is a proposal for completely reworking the Cloudify Manager upgrade mechanism.
TL;DR - the current mechanism will be replaced by version-specific self-contained RPMs that will be installed in-place only.
Current problems


Massive downtime - teardown/bootstrap are necessary for an upgrade.
Fragmentation - the need to support many version combinations makes the upgrade code horrible to maintain (and look at).
Agent upgrade - upgrading an old manager to a new manager VM is very problematic, because the agent upgrade mechanism relies on the old manager staying alive through the process.
Relying on snapshots - a mechanism that was not designed with upgrade in mind, and really shows its age. It is cumbersome, inefficient, slow and unnecessary.

Proposed solution

Version specific

Instead of supporting every possible combination, only jumps from one logical version to another will be support. A reasonable starting place would be:

3.4.2 -> 4.0.1
4.0.1 -> 4.1.1
4.1.1 -> 4.2.0

This solves several problems:

Except for the 3.4.2 -> 4.0.1 upgrade, no need to consider Elastisearch DB at a all.
Many more assumptions can be made for each particular upgrade (because we will know more about the version we’re upgrading).
Fragmentation in upgrades will be all but eliminated - no need to check 20 unrelated edge cases.

If the users need to upgrade from 3.4.2 to 4.2, they will have to run 2 upgrades.
In-place only

Support for upgrades that are not in place was introduced because the upgrade mechanism currently relies on snapshots. Moving away from this paradigm makes it clear that the only reasonable way to upgrade a manger would be in-place.
This solves many problems:

The only things that will need to be upgraded would be code, configurations and the DB schema. No need to touch the actual DB at all. Snapshots can be repurposed to become mere DB backups (as they were intended to be).
No DB migration will be necessary - only schema migrations, which is relatively easy.
Something close to 0 downtime can be achieved, because we’re only replacing code and config files.
Can be run locally, using the admin CLI. This should give us more control and reliability.
Will greatly simplify agent upgrade, as the IP will forever remain the same, and only one path for agent upgrade will be needed.

Implementation proposal

The upgrade will be a single self-contained RPM, which will hold:

The RPMs of the relevant parts to be upgraded (rest service, mgmtworker, CLI, etc).
Any necessary configuration files.
New DB migration files, if necessary.
A script that will deploy all of the necessary parts, restart any relevant services, and possibly trigger the agent upgrade.

The upgrade mechanism (at least going forward) can be made to rely on the existing local bootstrap code, which already implements many of the necessary features. This should simplify some of the challenges.
The RPMs will be available for download on S3, to be used easily by support and customers.
The deployment will be done by a simple yum install command.
New process

A new process for creating the upgrade RPMs will need to be instilled in the R&D to make sure that the RPMs are tested properly, and that all versions the customers have are indeed supported.
This will require rethinking how we test upgrades.