Skip to content

Instantly share code, notes, and snippets.

@djmitche

djmitche/todo.md Secret

Last active August 29, 2015 14:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save djmitche/0c2c968fa1f6a5b5e0ca to your computer and use it in GitHub Desktop.
Save djmitche/0c2c968fa1f6a5b5e0ca to your computer and use it in GitHub Desktop.
GHOST TODO
  • reimage all pod hosts
    • pod 1
    • pod 2
    • pod 3
    • pod 4
    • pod 5
    • pod 6
    • pod 10
  • re-create AWS builder images
  • re-create AWS masters - dustin
  • reimage onsite masters - 1st dustin, rest amy
  • reimage signing servers - dustin
  • re-create rpmpackager - dustin
  • reimage aws-manager1 - jake
@djmitche
Copy link
Author

djmitche commented Feb 4, 2015

stopping a master:

python buildfarm/maintenance/manage_masters.py -f buildfarm/maintenance/production-masters.json -H bm103-tests1-linux graceful_stop

(it can take a while!!)
c.f. https://wiki.mozilla.org/ReleaseEngineering/How_To/Manage_Buildbot_with_Fabric

@djmitche
Copy link
Author

djmitche commented Feb 4, 2015

AWS master process (bug 1130176):

  • pick a master, and clear it with sheriffs. Only one master of each type (build, test, try) across scl3 and AWS should be down at once.
  • disable it in slavealloc by clicking the edit button and unchecking the "enabled" checkbox
  • downtime it in nagios for a very long time (you'll be removing it from nagios before the downtime expires)
  • run the graceful_stop command and wait for it to finish (see below)
  • while waiting:
    • update the build.mozilla.org CNAME to point to .bb.
    • update the link in slavealloc to point to .bb.
    • disable the master in production-masters.json by setting "enabled": false, and change the hostname and db_name (see diff below)
  • when the graceful stop is complete:
    • halt the old master
    • reboot the new master and wait for it to come up and start buildbot
  • re-enable it in slavealloc (check the "enabled" checkbox)
  • re-enable it in production-masters.json (see below)
  • update nagios with the new hostname (see below)

Onsite master process (bug 1126428):

  • pick a master, and clear it with sheriffs. Only one master of each type (build, test, try) across scl3 and AWS should be down at once.
  • disable it in slavealloc by clicking the edit button and unchecking the "enabled" checkbox
  • delete the host from nagios and commit
  • run the graceful_stop command and wait for it to finish (see above)
  • while waiting:
    • kill the puppet crontask on the master (rm /etc/cron.d/puppetcheck and kill any running puppet)
    • update inventory to use the .bb. name and a new IP
      • set the new hostname on the system
      • set the new hostname and IP on the SREG
      • set the new DHCP scope on the hw
    • fix the build.mozilla.org CNAME
    • update the link in slavealloc to include .bb.
    • disable in production-masters.json by setting "enabled": false, and change the hostname and db_name (see diff below)
  • when the stop is complete:
    • change the network interface VLAN to vlan268 and hostname to .bb. in VMware
    • set the host to enter BIOS on next boot (settings -> options -> boot options)
    • reboot
    • ensure network boot is the first option
    • puppetize
    • login, then copy / paste the host's new ssh host key into modules/ssh/templates/known_hosts.erb in puppet and commit. Don't bother to remove the old host - we'll come back.
    • reboot & wait for it to come up and start buildbot
  • re-enable it in slavealloc (check the "enabled" checkbox)
  • re-enable it in production-masters.json (see below)
  • update nagios with the new hostname (see below)

lather, rinse, repeat.

Details

Graceful Shutdown:

You'll need a virtualenv with fabric and the tools repository installed, and to be at a place in the network where you can SSH to the master. $nick is given in slavealloc. Or you can just ask buildduty.

python buildfarm/maintenance/manage_masters.py -f buildfarm/maintenance/production-masters.json -H $nick graceful_stop

Commits

This file is in http://hg.mozilla.org/build/tools at buildfarm/maintenance/production-masters.json. We have blanket permission for these changes, but check the diff as hg will happily scoop up any other changes you might have made in the repo.

The first production-masters change should look like

diff --git a/buildfarm/maintenance/production-masters.json b/buildfarm/maintenance/production-masters.json
--- a/buildfarm/maintenance/production-masters.json
+++ b/buildfarm/maintenance/production-masters.json
@@ -1442,10 +1442,10 @@
     "buildbot_setup": "/builds/buildbot/tests1-linux/buildbot/master/setup.py",
     "buildbot_version": "0.8.2",
     "datacentre": "scl3",
-    "db_name": "buildbot-master103.srv.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master",
-    "enabled": true,
+    "db_name": "buildbot-master103.bb.releng.scl3.mozilla.com:/builds/buildbot/tests1-linux/master",
+    "enabled": false,
     "environment": "production",
-    "hostname": "buildbot-master103.srv.releng.scl3.mozilla.com",
+    "hostname": "buildbot-master103.bb.releng.scl3.mozilla.com",
     "http_port": 8201,
     "limit_fx_platforms": [
       "linux",

and can be committed with message Bug $bugid: disable and rename $shortname; r=bhearsum,rail.

The addition of an SSH key is just adding a single line (sorted correctly) and can be committed with Bug $bugid: add SSH host key for $shortname; r=dustin.

The second production-masters change just reverts "enabled" to true, and can be committed with Bug $bugid: re-enable $shortname; r=bhearsum,rail.

The nagios change just substitutes .bb. for .srv..

@djmitche
Copy link
Author

djmitche commented Feb 5, 2015

bm103 (scl3) was reimaged yesterday and puppetized just fine, but I forgot to enable it so it's not quite time to do the remaining buildmasters yet.

@djmitche
Copy link
Author

djmitche commented Feb 5, 2015

Woo, one green build. Ship it!

@djmitche
Copy link
Author

djmitche commented Feb 5, 2015

We can do multiple masters in parallel as long as they're different types (different pools in slavealloc).

@djmitche
Copy link
Author

djmitche commented Feb 6, 2015

Need to update the process to include moving hosts to the bb VLAN, and re-do buildbot-master{82,103}.

@amyrrich
Copy link

amyrrich commented Feb 8, 2015

@djmitche
Copy link
Author

djmitche commented Feb 9, 2015

I'm updating all AWS srv masters at once (so all but the last two)

  • add puppet nodes for them
  • add inventory SYS entries for them
  • add SREG entries for them
  • update the Name tag in EC2 for the old instances, to avoid conflict
  • create the new instances (but don't reboot, which would start buildbot)
  • gather all new SSH host keys and add them in a single puppet commit
  • proceed as above
  • disable termination protection and terminate the old instances
  • remove SYS and SREG entries for the old instances
  • remove old instances from puppet
  • remove old SSH keys from puppet
  • remove old IPs from the signing ACL
  • update flows
  • update security groups
  • update fwunit

@djmitche
Copy link
Author

djmitche commented Feb 9, 2015

invtool SYS create --operating-system-pk 82 --server-model-pk 773 --allocation-pk 2 --system-rack-pk 286 --system-type-pk 9 --system-status-pk 1 --hostname $host
invtool SREG create --fqdn $host --system-hostname $host --private --no-public --ttl 60 --ip $ip --comment 'Bug 1130176'

@nthomas-mozilla
Copy link

If the master IP address changes then update $config::signing_new_token_allowed_ips to flow through:
http://hg.mozilla.org/build/puppet/file/3c6c47638686/modules/signingserver/manifests/instance.pp#l70
http://hg.mozilla.org/build/puppet/file/3c6c47638686/modules/signingserver/templates/signing.ini.erb#l42

Otherwise builds will hit errors like:

======== Started download_token exception (results: 4, elapsed: 48 mins, 1 secs) (at 2015-02-10 13:39:25.678826) =========
Slave: bld-linux64-spot-291
IP: 10.134.53.126
Duration: 25200
URI: https://mac-signing3.srv.releng.scl3.mozilla.com:9110/token
<buildbotcustom.steps.signing.SigningServerAuthenication instance at 0xfcccf80>: token generation failed, error message: 403 Forbidden

@Callek
Copy link

Callek commented Feb 11, 2015

Also need to update CNAME for .build.mozilla.org --> *.bb. instead of .srv.

@djmitche
Copy link
Author

Snapshot of the processes on bm66.srv:

root      1158  0.0  0.0  20440  1272 ?        Ss    2013  16:28 crond
root     10169  0.0  0.0  39176  1672 ?        S    11:55   0:00  \_ CROND
cltbld   10171  0.0  0.0   9288  1200 ?        Ss   11:55   0:00  |   \_ /bin/bash /usr/local/bin/run_b2g_bumper.sh
cltbld   10205  0.0  0.0   9288   760 ?        S    11:55   0:00  |       \_ /bin/bash /usr/local/bin/run_b2g_bumper.sh
cltbld   12997  8.4  0.2 1327288 16404 ?       Sl   12:10   0:02  |           \_ python /builds/b2g_bumper/mozharness/scripts/b2g_bumper.py --base-work-dir /builds/b2g_bumper/master -c /builds/b2g_bumper/mozharness/configs/b2g_bumper/master.py --import-git-ref-cache --push-loop --export-git-ref-cache
cltbld   13469  0.0  0.0  15780  1040 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/prebuilts/clang/linux-x86/3.1 refs/tags/android-4.3_r2.1
cltbld   13471  2.0  0.0  89912  6672 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/prebuilts/clang/linux-x86/3.1 https://git.mozilla.org/external/caf/platform/prebuilts/clang/linux-x86/3.1
cltbld   13472  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/prebuilts/gcc/linux-x86/host/i686-linux-glibc2.7-4.6 refs/tags/android-4.3_r2.1
cltbld   13475  2.3  0.1  90976  7688 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/prebuilts/gcc/linux-x86/host/i686-linux-glibc2.7-4.6 https://git.mozilla.org/external/caf/platform/prebuilts/gcc/linux-x86/host/i686-linux-glibc2.7-4.6
cltbld   13473  0.0  0.0  15780  1040 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/prebuilts/gcc/linux-x86/host/x86_64-linux-glibc2.7-4.6 refs/tags/android-4.3_r2.1
cltbld   13476  2.0  0.0  89992  6672 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/prebuilts/gcc/linux-x86/host/x86_64-linux-glibc2.7-4.6 https://git.mozilla.org/external/caf/platform/prebuilts/gcc/linux-x86/host/x86_64-linux-glibc2.7-4.6
cltbld   13482  0.3  0.0  16440  1772 ?        R    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/device/common refs/tags/android-4.3_r2.1
cltbld   13485  7.3  0.1 103568 10204 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/device/common https://git.mozilla.org/external/caf/device/common
cltbld   13484  0.0  0.0  15780  1040 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/device/sample refs/tags/android-4.3_r2.1
cltbld   13492  2.0  0.1  90784  7700 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/device/sample https://git.mozilla.org/external/caf/device/sample
cltbld   13487  0.0  0.0  15780  1040 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/bionic refs/tags/android-4.3_r2.1
cltbld   13493  2.0  0.1  91792  8080 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/bionic https://git.mozilla.org/external/caf/platform/bionic
cltbld   13488  0.0  0.0  15780  1040 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/bootable/recovery refs/tags/android-4.3_r2.1
cltbld   13494  5.6  0.1 103364 10024 ?        Rl   12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/bootable/recovery https://git.mozilla.org/external/caf/platform/bootable/recovery
cltbld   13491  0.0  0.0  15780  1040 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/bluetooth/bluedroid refs/tags/android-4.3_r2.1
cltbld   13498  2.6  0.1 101000  7700 ?        Rl   12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/bluetooth/bluedroid https://git.mozilla.org/external/caf/platform/external/bluetooth/bluedroid
cltbld   13495  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/bison refs/tags/android-4.3_r2.1
cltbld   13499  1.6  0.0  89392  6368 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/bison https://git.mozilla.org/external/caf/platform/external/bison
cltbld   13497  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/bsdiff refs/tags/android-4.3_r2.1
cltbld   13503  4.0  0.1 102176  8548 ?        Rl   12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/bsdiff https://git.mozilla.org/external/caf/platform/external/bsdiff
cltbld   13500  1.6  0.0  18284  3660 ?        R    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/bzip2 refs/tags/android-4.3_r2.1
cltbld   13501  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/checkpolicy refs/tags/android-4.3_r2.1
cltbld   13506  3.3  0.1 101996  8280 ?        Rl   12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/checkpolicy https://git.mozilla.org/external/caf/platform/external/checkpolicy
cltbld   13502  0.0  0.0  15780  1040 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/dhcpcd refs/tags/android-4.3_r2.1
cltbld   13508  5.3  0.1 103496  9800 ?        Rl   12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/dhcpcd https://git.mozilla.org/external/caf/platform/external/dhcpcd
cltbld   13505  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/dnsmasq refs/tags/android-4.3_r2.1
cltbld   13507  6.3  0.1 103432 10000 ?        Rl   12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/dnsmasq https://git.mozilla.org/external/caf/platform/external/dnsmasq
cltbld   13512  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/dropbear refs/tags/android-4.3_r2.1
cltbld   13513  3.0  0.0  89104  6072 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/dropbear https://git.mozilla.org/external/caf/platform/external/dropbear
cltbld   13515  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/e2fsprogs refs/tags/android-4.3_r2.1
cltbld   13516  3.0  0.0  89104  6080 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/e2fsprogs https://git.mozilla.org/external/caf/platform/external/e2fsprogs
cltbld   13517  0.0  0.0  15780  1040 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/elfutils refs/tags/android-4.3_r2.1
cltbld   13518  3.0  0.0  89404  6240 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/elfutils https://git.mozilla.org/external/caf/platform/external/elfutils
cltbld   13520  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/expat refs/tags/android-4.3_r2.1
cltbld   13521  4.0  0.0  89104  6096 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/expat https://git.mozilla.org/external/caf/platform/external/expat
cltbld   13527  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/fdlibm refs/tags/android-4.3_r2.1
cltbld   13528  4.0  0.0  88968  6040 ?        S    12:10   0:00  |               |   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/fdlibm https://git.mozilla.org/external/caf/platform/external/fdlibm
cltbld   13531  0.0  0.0  15780  1044 ?        S    12:10   0:00  |               \_ git ls-remote https://git.mozilla.org/external/caf/platform/external/flac refs/tags/android-4.3_r2.1
cltbld   13532  2.0  0.0  85788  3788 ?        R    12:10   0:00  |                   \_ git-remote-https https://git.mozilla.org/external/caf/platform/external/flac https://git.mozilla.org/external/caf/platform/external/flac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment