Skip to content

Instantly share code, notes, and snippets.

@jpetazzo
jpetazzo / README.md
Created April 4, 2012 16:20
Repair a Riak bitcask-based cluster when the ring has gone out of control

So I heard you hosed your Riak cluster

I don't know what you did (I don't know what I did when this happened to me), but you ended up with a completely borked Riak cluster. Possible causes and symptoms include:

  • riak-admin transfers shows different things depending on the node you run it on
  • you tried to leave/join nodes to fix things, but it made them only worse
  • you ran mixed versions in parallel, instead of doing a clean rolling upgrade
  • some data seems to be missing, and when you list the keys in a bucket, clearly there is not the amount you were expecting
  • YOU'RE AFRAID YOU MIGHT HAVE LOST DATA
@lusis
lusis / opsschool.md
Created October 25, 2012 12:48
What happened here?

Random idea

I'm a big fan of the Ops School idea. I've struggled for years about how to "train up" someone coming into this field. So much of our skill set is forged in the fire of outages and troubleshooting.

One of the things that is both good and bad about system administration and operations is that we tend to see patterns. It's good in that we immediately see things that stand out. The downside is we tend to superimpose that pattern recognition in inappropriate ways.

We had an interesting issue happen yesterday at the dayjob. I won't go into exactly what it was here but I had an idea based on some graphs I was looking at. It's part social experiment but it's also part problem solving skills.

Given the following image with no context, what do you think happened? What are some of the key indicator points that jump out and what pattern do they call to mind?

_(Since it may not be clear simply from an image resolution perspective, there are 4 me