Skip to content

Instantly share code, notes, and snippets.

@zmc
Created December 10, 2022 00:28
Show Gist options
  • Save zmc/4bd9afd553d79d629cb77903d828158f to your computer and use it in GitHub Desktop.
Save zmc/4bd9afd553d79d629cb77903d828158f to your computer and use it in GitHub Desktop.

A few notes on the RHEVpocalypse

This document is very far from exhaustive; I just wanted to write a few things down before an extended PTO.

Things that were easier than expected

Rebuilding certain services in OpenShift without access to original deployments

  • This wouldn't have been feasible if David hadn't deployed it as a POC.
  • It will be even easier "next time" if we build a strong habit of keeping services and their configuration in version control.

Making the OpenShift cluster healthy (it wasn't very fuctional when the outage hit)

  • While there are a lot of rough edges in the OpenShift alerting UX, I was able to piece together solutions to many problems without much prior knowledge.
  • I had the only useable login - and that was partly by chance! Without that, I'm not sure we'd have been able to use it.

Things that were more difficult than expected

Obtaining an SSL wildcard cert for the OCP cluster

  • letsencrypt's normal challenge mechanism of "write this string to a file on your webserver" isn't acceptable for wildcards; it needs to be given control of your DNS to create a TXT record.
  • Due to the complexity of configuring bind (the nameserver) to allow updating the zone that letsencrypt expects to be able to modify, I felt the need to implement that in the ceph-cm-ansible nameserver role.
  • Since the cert was to be deployed as the OpenShift cluster's new cluster-wide cert, the client side of the process couldn't just be "run the certbot script on a host via SSH"; I had to discover and deploy the cert-manager operator.
  • None of the above worked on the first try - or the tenth ;)

Routing

  • Surprisingly, OpenShift Routes, while being extremely convenient, are only useable for HTTP. Any other protocol has to be implemented in a considerably more complex manner.
  • We still don't know how to properly open the cluster up to the public Internet, so we are using a lot more reverse proxies than we used to. This also invovles rewriting request headers.

Coordination

  • We could have used more hands.

Monitoring; we are still missing many critical pieces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment