Skip to content

Instantly share code, notes, and snippets.

@sean-horn
Forked from jeremiahsnapp/upgrade.md
Last active August 29, 2015 13:57
Show Gist options
  • Save sean-horn/9676909 to your computer and use it in GitHub Desktop.
Save sean-horn/9676909 to your computer and use it in GitHub Desktop.

All upgrades

All upgrades should be performed first with a copy of Production data, in an environment that is non-essential. This could be a Vagrant VM and is our typical test hardware for this purpose.

Backups are good to have for many reasons at all times. Especially during upgrades, tested backups become a critical part of our toolkit.

Upgrading from OPC1.2.x

Upgrading from OPC1.2x to EC11x REQUIRES that you first upgrade all systems in a cluster to OPC1.4.6. Please be aware of [1] at the transitions between the systems. Here is an example of the highest level upgrade process that should be followed.

Check runsvdir status -> OPC-1.2.x -> Check runsvdir status -> OPC1.4.6 -> Check runsvdir status -> EC11x -> Check runsvdir status

OPC 1.4.6 -> EC11.1.2 Upgrade process related bugs

  1. OC-11297 - EC 11.0.X not saving its migration-level state on HA backends. Breaks p-c-c upgrade on subsequent upgrades
  2. OC-11382 - HA Upgrades to 11.1.x fail because keepalived restart interferes with partybus migrations

NOTE: Unless otherwise noted, all patching should be done after OPC 1.4.6 is installed, and before the EC11.1.2 package install and upgrade begins.

Pre-Flight Check

  1. Backup the data on the bootstrap backend machine. (e.g. LVM snapshot, VMware snapshot, etc)

  2. Run the following on all machines to make sure things are in a sane state. (e.g. runit)

     private-chef-ctl reconfigure
    
  3. Stop all frontend machines.

     private-chef-ctl stop
    
  4. Identify the name of the original non-bootstrap backend machine. This is the backend machine that does not have :bootstrap => true in /etc/opscode/private-chef.rb.

  5. Stop keepalived on the original non-bootstrap backend machine. This will ensure that the bootstrap backend machine is the active machine. This action may trigger a failover.

     private-chef-ctl stop keepalived
    
  6. OC-11297 - On the backend machines, examine the /var/opt/opscode/upgrades/migration-level file. It should match the version on the frontend machines. In HA systems, the migration-level file is usually correct on the front end nodes but not the backend nodes due to the fact that the backend installation process gets interrupted for DRBD setup. If it is incorrect on the backend nodes, please copy it from the frontend nodes before proceeding.

    EC Version migration-state
    1.4.6 major: 1, minor: 7
    11.0.x major: 1, minor: 12
    11.1.x major: 1, minor: 13
  7. Before proceeding, make sure that the bootstrap backend node and all of its services are healthy, and that all services are stopped on the standby. Please refer to [1] to make a determination about "healthy".

Upgrade Steps

  1. Install the Enterprise Chef server package on all machines using dpkg or rpm.

  2. OC-11382 - On both backend machines, copy the upgrade.rb from this gist to /opt/opscode/embedded/service/omnibus-ctl/upgrade.rb.

     cp /tmp/upgrade.rb /opt/opscode/embedded/service/omnibus-ctl/upgrade.rb
    
  3. On the bootstrap backend machine, perform a reconfigure and then WAIT about 2 minutes until all services have returned to a normal, working state according to ha-status and /var/log/opscode/keepalived/cluster.log

     private-chef-ctl reconfigure
    
  4. Once all services are verified Upgrade the bootstrap backend machine. (If anything strange happens here, please consider how the issue you see you could be related to runit. Please refer to [1] for cleanup. You will also need to ensure that all omnibus-ctl, private-chef-ctl, and sv processes are gone. Then, be sure that the opscode-chef-mover service is started and retry the upgrade.)

     private-chef-ctl upgrade
    
  5. Copy the entire /etc/opscode directory from the bootstrap backend machine to all frontend and backend machines. For example, from each machine run:

     scp -r BOOTSTRAP_SERVER_IP:/etc/opscode /etc
    
  6. Upgrade the secondary backend machine.

     private-chef-ctl upgrade
    
  7. Upgrade all frontend machines.

     private-chef-ctl upgrade
    
  8. Run the following on all machines to make sure all services are started.

     private-chef-ctl start
    
  9. After the upgrade process is complete, the state of the system after the upgrade has been tested and verified, remove old data on all machines.

     private-chef-ctl cleanup
    

Runit Process Structure and Checks

Please use the following diagram to understand the runit process supervision tree. All runit components can be inspected with ps aux | grep [s]v

    RHEL6/Ubuntu10.04+ Upstart (Upstart config file in /etc/init/opscode-runsvdir on pre EC11x, /etc/init/private-chef-runsvdir in EC11x)
    |
    opscode-runsvdir or private-chef-runsvdir 
    |
    ----> runsv -> (EC11 service like postgresql or opscode-erchef)
    ----> svlogd -> (Logging for each service's STDOUT. Goes into a "current" file)

Between upgrades from major version to major version of Private Chef or Enterprise Chef, you will want to check that the ps aux | grep [r]unsvdir output looks like this

root 1543 0.0 0.0 4032 196 ? Ss 20:18 0:00 runsvdir -P /opt/opscode/service log: ...........................................................................................................................................................................................................................................................................................................................................................................................................

and not like this

root 864 0.0 0.0 4088 476 ? Ss 2013 14:26 runsvdir -P /opt/opscode/service log: not exist?svlogd: pausing: unable to rename current: /var/log/opscode/opscode-erchef: file does not exist?svlogd: pausing: unable to rename current: /var/log/opscode/opscode-erchef: file does not exist?svlogd: pausing: unable to rename current: /var/log/opscode/opscode-erchef: file does not exist?svlogd: pausing: unable to rename current: /var/log/opscode/opscode-erchef: file does not exist?.

Any number of issues can occur with runit's runsvdir process. The most common in an OPC/EC11 setting are these

  • In OPC 1.4.6, /var/log/opscode should have 755 permissions, but it doesn't
  • Any of the /var/log/opscode/SERVICE/current files are missing
  • In EC11, the ownership of /var/log/opscode is not opscode, so the processes cannot read/write their logfiles
  • The filesystem where the logs are stored is full

When you encounter a problem like this, the process is to check the error output in the processlist as above, and figure out what has gone wrong for either the runsvdir, or its svlogd processes, or both. Correct the issue, shutdown OPC/EC11, then use Upstart to restart runit's runsvdir:

  • private-chef-ctl stop
  • For OPC1.4.6 on RHEL6 and ubuntu10.04+ initctl stop opscode-runsvdir
  • For EC11x on RHEL6 and ubuntu10.04+ initctl stop private-chef-runsvdir
  • NOTE: During the upgrade of OPC 1.4.6 -> EC11.1.2, you may have both of the above.
  • If continuing an EC11.1.2 upgrade initctl start private-chef-runsvdir
  • If fixing up an OPC1.4.6 system before an upgrade to EC11.1.2 initctl start private-chef-runsvdir

LDAP Authentication Bug

OC-11384 - EC 11.1.1+: Creating a new user with LDAP enabled fails

If you use LDAP authentication for the Enterprise Chef server then you will also want to use the following instructions on the frontend machines.

https://gist.github.com/irvingpop/9399446

#
# Copyright:: Copyright (c) 2012 Opscode, Inc.
#
# All Rights Reserved
#
add_command "upgrade", "Upgrade your private chef installation.", 1 do
reconfigure(false)
Dir.chdir(File.join(base_path, "embedded", "service", "partybus"))
bundle = File.join(base_path, "embedded", "bin", "bundle")
status = run_command("echo 'Sleeping for 2 minutes before migration' ; sleep 120 ; #{bundle} exec ./bin/partybus upgrade")
if status.success?
puts "Chef Server Upgraded!"
exit 0
else
exit 1
end
end
@jeremiahsnapp
Copy link

Thanks for such a great contribution. I merged your changes into my gist. I made a couple changes but let me know if you want anything reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment