Last active
April 10, 2018 15:37
-
-
Save KenG98/2c247985845aa52d07bd48e923be09d1 to your computer and use it in GitHub Desktop.
April 10th, 10:30am massive bostonhacks.io outage -- post mortem
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# DISCOVERY | |
In the early morning hours of Tuesday, April 10th, 2018, BostonHacks member Charles Ma discovered that the landing page for bostonhacks, bostonhacks.io, was offline. Charles promptly accessed his facebook messenger to notify the BostonHacks team (from now on refered to as BH team). Ken Garber then opened his computer to diagnose and repair the issue. | |
# DIAGNOSIS | |
Accessing bostonhacks.io, Ken noticed that he was getting an error message from cloudflare, showing that, "The SSL certificate presented by the server did not pass validation...". He realized the SSL certificates from letsencrypt are short lived and probably just expired. | |
# REPAIR | |
First Ken turned off Nginx, as it gets in the way of "certbot", the program which renews our certificates. He ran the "sudo service nginx stop" command. Next, he accessed cloudflare to turn off the proxy for bostonhacks.io,museo.bostonhacks.io, and www.bostonhacks.io. He ran the command "sudo certbot renew --cert-name bostonhacks.io --dry-run" to dry run a renewal of the certficiate. It still returned an error. | |
Perplexed, he read through the error to find that our certificate currently includes "calendar.bostonhacks.io", even through we removed that subdomain from our DNS because we don't need it anymore. A quick online search shows there's not a simple way to remove a subdomain from a certficiate before Ken's 12:30 class, so he just added the calendar subdomain back to cloudflare, and we'll deal with it in the near future. | |
Finally, the dry run option passed, so Ken ran the command without the dry run flag, and the certificates renewed successfully. He then started nginx again with "sudo service nginx start". | |
# FUTURE WORK | |
We have to remove calendar.bostonhacks.io from our certificate and make some changes to letsencrypt on our server to get the renewal to work without calendar. | |
# MITIGATION | |
There is now a script on the server at /home/kgarber/certbot-renew-script.sh, which should stop nginx, renew the certs, and start nginx again. *** NOTE *** You need to manually turn off cloudflare proxy for it to work, I think, but I'm not sure. On cloudflare under the DNS tab, just click the orange cloud to turn off proxy but keep DNS on. | |
# CREDITS | |
Charles is commended in his quick response and wit. His actions led to a very short outage, and all of the BH members and users appreciate his dedication. Also thanks to Andrew because this stuff is tough to set up and he made it easy, with certbot and nginx configs and whatnot. | |
# ADDENDUM | |
This port mortem is a joke, but will be useful once Andrew leaves us and we get the same error. It will live in the google drive for now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment