Skip to content

Instantly share code, notes, and snippets.

@KenG98 KenG98/pm.txt
Last active Apr 10, 2018

Embed
What would you like to do?
April 10th, 10:30am massive bostonhacks.io outage -- post mortem
# DISCOVERY
In the early morning hours of Tuesday, April 10th, 2018, BostonHacks member Charles Ma discovered that the landing page for bostonhacks, bostonhacks.io, was offline. Charles promptly accessed his facebook messenger to notify the BostonHacks team (from now on refered to as BH team). Ken Garber then opened his computer to diagnose and repair the issue.
# DIAGNOSIS
Accessing bostonhacks.io, Ken noticed that he was getting an error message from cloudflare, showing that, "The SSL certificate presented by the server did not pass validation...". He realized the SSL certificates from letsencrypt are short lived and probably just expired.
# REPAIR
First Ken turned off Nginx, as it gets in the way of "certbot", the program which renews our certificates. He ran the "sudo service nginx stop" command. Next, he accessed cloudflare to turn off the proxy for bostonhacks.io,museo.bostonhacks.io, and www.bostonhacks.io. He ran the command "sudo certbot renew --cert-name bostonhacks.io --dry-run" to dry run a renewal of the certficiate. It still returned an error.
Perplexed, he read through the error to find that our certificate currently includes "calendar.bostonhacks.io", even through we removed that subdomain from our DNS because we don't need it anymore. A quick online search shows there's not a simple way to remove a subdomain from a certficiate before Ken's 12:30 class, so he just added the calendar subdomain back to cloudflare, and we'll deal with it in the near future.
Finally, the dry run option passed, so Ken ran the command without the dry run flag, and the certificates renewed successfully. He then started nginx again with "sudo service nginx start".
# FUTURE WORK
We have to remove calendar.bostonhacks.io from our certificate and make some changes to letsencrypt on our server to get the renewal to work without calendar.
# MITIGATION
There is now a script on the server at /home/kgarber/certbot-renew-script.sh, which should stop nginx, renew the certs, and start nginx again. *** NOTE *** You need to manually turn off cloudflare proxy for it to work, I think, but I'm not sure. On cloudflare under the DNS tab, just click the orange cloud to turn off proxy but keep DNS on.
# CREDITS
Charles is commended in his quick response and wit. His actions led to a very short outage, and all of the BH members and users appreciate his dedication. Also thanks to Andrew because this stuff is tough to set up and he made it easy, with certbot and nginx configs and whatnot.
# ADDENDUM
This port mortem is a joke, but will be useful once Andrew leaves us and we get the same error. It will live in the google drive for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.