Skip to content

Instantly share code, notes, and snippets.

@evanphx
Created October 3, 2012 21:55
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save evanphx/3830137 to your computer and use it in GitHub Desktop.
Save evanphx/3830137 to your computer and use it in GitHub Desktop.
Rubygems.org outage - Post Mortem

Context

The rubygems.org infrastructure has been under heavier and heavier load and the need to move to something that handles the load better was well established.

Migration to new setup

On October 3rd, I opted to begin the migration to a setup based on RackSpace Cloud I had been working on for the last few weeks. This setup uses different instances for load balancers, application servers, and database and is designed to be able to deal better with the load that rubygems.org generates.

Upon coordinating with other developers in #rubygems on irc.freenode.net, at 13:56 PT we (the rest of the rubygems.org team) were all happy with the new setup and begin the transition.

Almost immediately, we noticed that the app servers were experiencing packet loss to the database servers. I contacted RackSpace and was told that the were experiencing a network configuration issue with the VMs.

Additionally, users began to report missing gems on rubygems.org UI. Upon investigation, it was found that the wrong database dump had been loaded (2 dumps were generated about 10 minutes apart).

Switch back

We opted to switch back the old setup. DNS was adjusted and the old server was brought out of maintenance mode. Some gems had been pushed to the new systems, so we're currently restoring those gems. If a gem disappeared for you and doesn't appear to be restored, you should be able to simply push it again.

Conclusion

We know that the ruby community puts a lot of trust in the rubygems.org infrastructure. I'm very sorry for the service disruption and am going to work to prevent these sorts of issues in the future.

As we circle the wagons and figure out what went wrong, we will work hard to figure out a path to improve rubygems.org that avoids these kinds of issues in the future.

Involvement

Rubygems.org is a volunteer organization funded by RubyCentral to provide a stable rubygems environment for the ruby community. If you would like to be involved, stop by #rubygems on irc.freenode.net or email me (evan@rubycentral.org).

  • Evan Phoenix and the Rubygems.org Team
@databyte
Copy link

databyte commented Oct 3, 2012

I'm not sure if your current configuration allows for it... Under new infrastructure configurations, I echo the requests from the old LB to the new LB but drop the second set of responses. So you're sending browser traffic to both old and new servers but you simply ignore the responses to your new servers and let your old servers provide the real responses.

Another common technique is to trap all inbound requests and play them back to the new servers but I find it doesn't really work out the same way. Either the playback isn't timed right, you have to tweak the data, etc. The former option is easier if your network configuration allows for it.

Good luck and keep up the good work!

@myronmarston
Copy link

Thanks for being transparent about this, Evan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment