LevelDB can become corrupted when bad things happen on the filesystem or in hardware. We push the I/O to the limits on heavily loaded Riak nodes so it is not uncommon to experience such failures. This one exhibits as a message Compaction error: Corruption: corrupted compressed block contents in the «data_root»/leveldb/«vnode»/LOG
file.
[root@prod-2163 /var/db/riak/leveldb]# find . -name "LOG" -exec grep -l 'Compaction error' {} \;
./442446784738847563128068650529343492278651453440/LOG
./448155775509671402652301794407141472824182439936/LOG
./145579264656007907867945168883848503911040155648/LOG
2012/03/18-16:45:55.649589 57 Compaction error: Corruption: corrupted compressed block contents
2012/03/18-16:45:55.649643 4b waiting...
2012/03/18-16:45:56.105357 57 Skipping expansion on level 0 from 12 to 12 files
2012/03/18-16:45:56.105418 57 Compacting 12@0 + 5@1 files
2012/03/18-16:45:56.111994 57 Generated table #200557: 162 keys, 174112 bytes
2012/03/18-16:45:56.169928 57 Generated table #200558: 224 keys, 2112499 bytes
2012/03/18-16:45:56.227341 57 Generated table #200559: 239 keys, 2111625 bytes
2012/03/18-16:45:56.285007 57 Generated table #200560: 230 keys, 2108929 bytes
2012/03/18-16:45:56.341888 57 Generated table #200561: 223 keys, 2109107 bytes
2012/03/18-16:45:56.375369 57 Generated table #200562: 116 keys, 1287455 bytes
2012/03/18-16:45:56.429633 57 compacted to: files[ 12 5 54 200 0 0 0 ]
2012/03/18-16:45:56.430168 57 Delete type=2 #200557
2012/03/18-16:45:56.430327 57 Delete type=2 #200559
2012/03/18-16:45:56.430871 57 Delete type=2 #200562
2012/03/18-16:45:56.431242 57 Delete type=2 #200563
2012/03/18-16:45:56.432571 57 Delete type=2 #200558
2012/03/18-16:45:56.433146 57 Delete type=2 #200560
2012/03/18-16:45:56.433723 57 Delete type=2 #200561
2012/03/18-16:45:56.434338 57 Compaction error: Corruption: corrupted compressed block contents
Which indicates that these vnode's LevelDB databases are in need of repair. We can do that, but it's very odd to have more than one corrupt at any given moment. This may be indicative of a larger issue.
- Finding one compaction error is interesting, more than one might be a strong indication of a hardware or OS bug.
-
start an Erlang session (do not start riak, we just want Erlang)
/opt/local/riak/erts-5.8.5/bin/erl
-
from erlang console perform the following command to open the LevelDB database
[application:set_env(eleveldb, Var, Val) || {Var, Val} <- [{max_open_files, 2000}, {block_size, 1048576}, {cache_size, 2010241024*1024}, {sync, false}, {data_root, "/var/db/riak/leveldb"}]]. ```
-
For each of the corrupted LevelDB databases (found by
[root@prod-2163 /var/db/riak/leveldb]# find . -name "LOG" -exec grep -l 'Compaction error' {} \;
) run this command substituting in the proper vnode number.
eleveldb:repair("/var/db/riak/leveldb/442446784738847563128068650529343492278651453440", []).
4. When all have finished successfully you may restart the node
riak start
5. Check for proper operation by looking at log files in /var/log/riak and in the LOG files in the effected LevelDB vnodes.
6. Contact us with any concerns.
# References and Addenda
## Links to tickets, bugs, pull requests as well as system information
* Example assumes SmartOS, use `riak start` rather than `svcadm enable`
* Seen in ZenDesk ticket #117 https://basho.zendesk.com/tickets/1117