Skip to content

Instantly share code, notes, and snippets.

@jtimberman
Last active August 29, 2015 14:02
Show Gist options
  • Save jtimberman/3fc9400105a20c545949 to your computer and use it in GitHub Desktop.
Save jtimberman/3fc9400105a20c545949 to your computer and use it in GitHub Desktop.
This is the public copy of the postmortem document for the http->https redirect issue on supermarket.getchef.com.

2014-06-18 - Supermarket HTTP Redirect - Non-production

  1. This is a blameless Post Mortem.
  2. We will not focus on the past events as they pertain to "could've" "should've"...
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don't make it a follow up item.

Incident Leader: Joshua Timberman

Description

HTTP redirect for supermarket.getchef.com was loading a default "Welcome to Nginx" page instead of properly redirecting to the Supermarket Rails application.

Timeline

Times in UTC on June 18, 2014.

  • 17:55 - Production deployment begins to redirect HTTP -> HTTPS for supermarket.
  • 18:00 - http://supermarket.getchef.com redirects to https://supermarket.getchef.com
  • 18:02 - External monitoring system notifies oncall that supermarket is down. Oncall is aware the deployment is happening, but not the cause for the outage.
  • 18:02 - Chef operations begins investigating configuration, which was used and tested on the staging site (http://supermarket-staging.getchef.com).
  • 18:02-18:18 - Chef operations discusses internally how the redirect works, to confirm that it should be doing the right thing.
  • 18:18 - nginx configuration verified by using an SSH tunnel w/ forwarded ports (443->8443, localhost:8443 working, localhost:8080 redirected to https://supermarket.getchef.com).
  • 18:20 - A discrepency in the production ELB used for supermarket was found, where the listener for HTTPS was pointing to port 80 on the instances instead of port 443. It's unknown why this lead to an nginx welcome page.
  • 18:21 - Service restored, http to https redirect is working, and users can browse supermarket using https, sign out/in, and the knife supermarket plugin is functional.

The supermarket application was down for approximately 20 minutes. This is considered non-impacting, because the supermarket site hasn't fully replaced the existing Community Site yet.

Root Cause

The supermarket application runs on three AWS instances. It is a Rails app run under the unicorn http server, which listens on a unix domain socket on each of the instances. Nginx is used as a local reverse proxy for each of the application servers. Prior to the change that was deployed, nginx only listened on port 80, and the ELB was configured to point port 443 to port 80. In order to enforce HTTPS, nginx had to be configured to redirect to port 443. For background, see http://frankmitchell.org/2013/05/https-elb.

Prior to the change, users could connect to supermarket via https directly, using https://supermarket.getchef.com. The point of the change was to ensure that HTTP is redirected to HTTPS. Operations had a detailed deployment plan. The root cause of the issue is that the listener for HTTPS in the ELB was pointed at the instances on port 80, instead of on 443, so the proper redirection wasn't happening.

Nginx configuration snippet from the cookbook:

/etc/nginx/sites-enabled/default

server {
  listen 80;
  # several proxy settings...
  location / {
    if ($http_x_forwarded_proto != 'https') {
      return 301 https://$server_name$request_uri;
    }
  }
}

server {
  listen 443;
  # other settings...
}

It isn't clear why the default Nginx welcome page was displayed.

Stabilization Steps

The Production ELB Listeners were configured to have Load Balancer Protocol HTTPS on load balancer port 443 point to instance port 443, instead of instance port 80.

Impact

The production supermarket app was down for about 20 minutes. This had no customer impact, as it isn't a service directly used by Hosted Chef, and is still in a "beta" soft-launch phase.

Corrective Actions

Action items going forward to fix the issue and root cause. This should include owners/teams assigned to these actions to see them through.

  • Done ELB listener for https pointed at port 443 on instances instead of port 80
  • Staging configuration should be the same as production - Chef Operations & Community team
  • Update Chef's AWS cookbook for better ELB + instance pool integration - Chef Community
  • External nagios check needs to follow 301 and check HTTPS - Chef Operations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment