Skip to content

Instantly share code, notes, and snippets.

@tsuna
Created November 17, 2011 18:42
Show Gist options
  • Save tsuna/1374041 to your computer and use it in GitHub Desktop.
Save tsuna/1374041 to your computer and use it in GitHub Desktop.
Script to get stats on the number of KeyValue and size of an HBase table, directly from HFiles
#!/bin/bash
# Copyright (c) 2010, 2011 Benoit Sigoure. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# - Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# - Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
# - Neither the name of the StumbleUpon nor the names of its contributors
# may be used to endorse or promote products derived from this software
# without specific prior written permission.
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
table=$1
family=$2
fatal() {
echo >&2 "error: $*"
exit 1
}
TMP=${TMP-"/tmp/$USER"}
mkdir -p "$TMP"
cd "$TMP"
dir="/hbase/$table/*/$family"
hadoop fs -ls "$dir" >"$table-$family-hfiles-raw" || {
fatal "Couldn't list $dir on HDFS"
}
awk '{print $NF}' "$table-$family-hfiles-raw" >"$table-$family-hfiles"
nhfiles=`wc -l <"$table-$family-hfiles"`
i=0
for hfile in `<"$table-$family-hfiles"`; do
echo >&2 -n -e "\rReading HFile meta data: $i/$nhfiles"
hbase org.apache.hadoop.hbase.io.hfile.HFile -f "$hfile" -m 2>/dev/null || {
fatal "couldn't read HFile metadata for $hfile (rv=$?)"
}
i=$((i+1))
done >"$table-$family-metadata"
echo >&2 -n -e "\r \r"
egrep 'length=|entryCount=' "$table-$family-metadata" >"$table-$family-stats"
if [[ "$i" -ne "$nhfiles" ]]; then
echo >&2 "warning: collected meta data from only $i out of $nhfiles HFiles"
fi
nkvs=`sed -n 's/.*entryCount=\\([0-9]*\\), .*/\\1+\\\\/p' "$table-$family-stats" | awk '{print}END{print 0}' | bc`
totalbytes=`sed -n 's/.*totalBytes=\\([0-9]*\\), .*/\\1+\\\\/p' "$table-$family-stats" | awk '{print}END{print 0}' | bc`
actualbytes=`sed -n 's/.*length=\\([0-9]*\\)$/\\1+\\\\/p' "$table-$family-stats" | awk '{print}END{print 0}' | bc`
avgkeylen=`sed -n 's/.*avgKeyLen=\\([0-9]*\\), .*/\\1+\\\\/p' "$table-$family-stats" | awk 'BEGIN{print "scale=1;(\\\\"}{print}END{print "0)/'$i'"}' | bc`
avgvallen=`sed -n 's/.*avgValueLen=\\([0-9]*\\), .*/\\1+\\\\/p' "$table-$family-stats" | awk 'BEGIN{print "scale=1;(\\\\"}{print}END{print "0)/'$i'"}' | bc`
echo "Number of HFiles: $i"
echo "Number of KeyValues: $nkvs"
echo "Total size: $totalbytes (`echo "scale=1;$totalbytes/1024/1024/1024" | bc`GB)"
echo "Actual size: $actualbytes (`echo "scale=1;$actualbytes/1024/1024/1024" | bc`GB)"
echo "Compression: `echo "scale=2;$totalbytes/$actualbytes" | bc`x"
echo "Average key length: $avgkeylen bytes"
echo "Average value length: $avgvallen bytes"
echo "Average size of a KV: `echo "scale=1;$totalbytes/$nkvs" | bc` bytes (`echo "scale=1;$actualbytes/$nkvs" | bc` compressed)"
@tsuna
Copy link
Author

tsuna commented Nov 17, 2011

Sample output:

Number of HFiles:     2405
Number of KeyValues:  133970093902
Total size:           8105337861383 (7548.6GB)
Actual size:          1526938321330 (1422.0GB)
Compression:          5.30x
Average key length:   39.4 bytes
Average value length: 7.5 bytes
Average size of a KV: 60.5 bytes (11.3 compressed)

@ozbrancov
Copy link

Can you please add a usage example to this script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment