Skip to content

Instantly share code, notes, and snippets.

@jpetazzo
Created April 4, 2012 16:20
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jpetazzo/2303461 to your computer and use it in GitHub Desktop.
Save jpetazzo/2303461 to your computer and use it in GitHub Desktop.
Repair a Riak bitcask-based cluster when the ring has gone out of control

So I heard you hosed your Riak cluster

I don't know what you did (I don't know what I did when this happened to me), but you ended up with a completely borked Riak cluster. Possible causes and symptoms include:

  • riak-admin transfers shows different things depending on the node you run it on
  • you tried to leave/join nodes to fix things, but it made them only worse
  • you ran mixed versions in parallel, instead of doing a clean rolling upgrade
  • some data seems to be missing, and when you list the keys in a bucket, clearly there is not the amount you were expecting
  • YOU'RE AFRAID YOU MIGHT HAVE LOST DATA

Don't panic—at least not before having tried this.

  1. Install a new server (spin up a VM, whatever...)
  2. Install a brand new, virgin Riak in it
  3. Stop the riak node running on the new server: riak stop
  4. Wipe it out: rm -rf /var/lib/riak/*
  5. Recreate the bitcask directory: mkdir /var/lib/riak/bitcask
  6. Create a directory (e.g. ~/bitcasks) in this machine
  7. Copy the /var/lib/riak/bitcask directory of each node of your borked cluster into ~/bitcask/node-$HOSTNAME (this $HOSTNAME should be the hostname of the node, not the hostname of your new server)
  8. Copy the merge-bitcask.py file to the same directory
  9. Run it, inspect the output (it should have 1 line per partition, i.e. 64 by default)
  10. Run it again for real: python merge-bitcask.py | sh
  11. Start the riak node and see if your data is there

How does it work?

The bitcask directory contains one subdirectory per partition. Sometimes (at least, that's what happened to me!) partitions get all messed up, and nodes don't know which other node owns which partition. The method described here merges all the partitions to a single new node. But, in some cases, multiple versions of a same partition will be present on different nodes. This script just checks the size of the partitions, and retains each time the biggest partition. You can probably do the same thing with a mix of du/sort/awk.

#!/usr/bin/env python
import os
import glob
sourcedirs = glob.glob('node-*')
vnodes = set()
for sourcedir in sourcedirs:
vnodes |= set(os.listdir(sourcedir))
vnodes.remove('manual_cleanup')
for vnode in vnodes:
biggestsize = 0
biggestsource = None
for sourcedir in sourcedirs:
thissize = 0
if not os.path.isdir(os.path.join(sourcedir, vnode)):
continue
for bcfile in os.listdir(os.path.join(sourcedir, vnode)):
thissize += os.stat(os.path.join(sourcedir, vnode, bcfile)).st_size
if thissize > biggestsize:
biggestsize = thissize
biggestsource = sourcedir
print 'cp -r {biggestsource}/{vnode} /var/lib/riak/bitcask'.format(**locals())
@gburd
Copy link

gburd commented Apr 5, 2012

I'd like to understand more about what you mean by "all messed up" because that is not something that generally happens. Please email me greg AT basho DOT company. This is not a normal repair process that we've used so I'm not sure what effect it will have. By the way, doing excessive leaves/joins is not a great idea in general. We should talk about your use case and experiences.

@jpetazzo
Copy link
Author

jpetazzo commented Apr 5, 2012

Hi Greg—sent you an e-mail this morning. We'll be happy to provide as much info as we can.

Note for innocent bystanders: this doesn't mean that Riak is unreliable/broken/buggy/whatever. It means that we did something wrong, and Riak did let us shoot ourselves in the foot. That might sound bad. However, even without knowing Riak internals, we were able to recover, without losing a single bit of data. That's definitely good, IMHO.

@gburd
Copy link

gburd commented Apr 5, 2012

There's enough blame to go around. :) The thing you "did wrong" was not entirely your fault. Basically, the transfers were still happening and things were just taking a while to work out during a ring rebalance when you issued join/leave and that caused the confusion. You needed more feedback from our product to make better decisions about what to do when administering a cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment