zmc/rhevpocalypse.md

## rhevpocalypse.md

      
    Raw
  

              rhevpocalypse.md
            
          
    A few notes on the RHEVpocalypse

This document is very far from exhaustive; I just wanted to write a few things down before an extended PTO.
Things that were easier than expected

Rebuilding certain services in OpenShift without access to original deployments


This wouldn't have been feasible if David hadn't deployed it as a POC.
It will be even easier "next time" if we build a strong habit of keeping services and their configuration in version control.

Making the OpenShift cluster healthy (it wasn't very fuctional when the outage hit)


While there are a lot of rough edges in the OpenShift alerting UX, I was able to piece together solutions to many problems without much prior knowledge.
I had the only useable login - and that was partly by chance! Without that, I'm not sure we'd have been able to use it.

Things that were more difficult than expected

Obtaining an SSL wildcard cert for the OCP cluster


letsencrypt's normal challenge mechanism of "write this string to a file on your webserver" isn't acceptable for wildcards; it needs to be given control of your DNS to create a TXT record.
Due to the complexity of configuring bind (the nameserver) to allow updating the zone that letsencrypt expects to be able to modify, I felt the need to implement that in the ceph-cm-ansible nameserver role.
Since the cert was to be deployed as the OpenShift cluster's new cluster-wide cert, the client side of the process couldn't just be "run the certbot script on a host via SSH"; I had to discover and deploy the cert-manager operator.
None of the above worked on the first try - or the tenth ;)

Routing


Surprisingly, OpenShift Routes, while being extremely convenient, are only useable for HTTP. Any other protocol has to be implemented in a considerably more complex manner.
We still don't know how to properly open the cluster up to the public Internet, so we are using a lot more reverse proxies than we used to. This also invovles rewriting request headers.

Coordination


We could have used more hands.

Monitoring; we are still missing many critical pieces