mihasya/checklist.md Secret

## checklist.md

      
    Raw
  

              checklist.md
            
          
    These are our cassandra upgrade checklists almost exactly as performed. It includes notes about unexpected things that occured on the first node.

stop Chef from running via cron
knife node run_list set ***********.opsmatic.com steph-role
nodetool upgradesstables - this appears to be required before AND after upgrade?
nodetool drain (gracefully stop serving traffic)
sudo service cassandra stop
remove prod-cassandra security group and verify the node cannot talk to the rest of the machines on the service ports

aws ec2 modify-instance-attribute --instance-id i-******** --groups sg-********
on host: telnet cass02.usw2.opsmatic.com 7000


sudo chef-client --once - actually apply the new Chef cookbooks.

Note: small snag: first chef run fails because somehow cassandra ends up getting started up with the old cassandra.yaml file; simply sudo rm /etc/cassanda/cassandra.yaml and run chef again
Note: small snag: cluster_ips used to populate seed nodes returned no results because no hosts were yet in steph-role. Using the _cluster_ips attribute override in the environment to get around this. That's probably for the best for the early part of the process, since I can manually set that to just have all cassandra-role nodes, which will allow the 2.x node to gossip with the original cluster.
Note: big snag: we were using a version of the cassandra cookbook that didn't support multiple data directories. Had to upgrade to version 3.4.0 of the cookbook (from 2.9.0) - the cookbook had since been renamed etc. Was a bit of open-heart chef surgery, but we're back at it.


visual spot check of configuration

initial_token should not be set in /etc/cassandra/cassandra.yaml
data directories should be pointing to /data1/keyspaces and /data2/keyspaces
Heap size should be 6GB


start the service back up (probably already done by chef) - sudo service cassandra start
service should start but complain about not being able to talk to the rest of the cluster. it SHOULD show itself in nodetool status as having a bunch of data, though it may not ever get to the point where nodetool will work due to not being able to gossip
restore prod-cassandra security group - aws ec2 modify-instance-attribute --instance-id i-******** --groups sg-******** sg-********
start service back up if it had previously died due to being unable to gossip
nodetool upgradesstables

Note: default compaction throughput got reset to 16mb/sec in steph-role; setting it manually to 1024 so that upgradesstables finishes more quickly, gonna set the chef default to whatever cassandra-role was running with when this operation is finished