Skip to content

Instantly share code, notes, and snippets.

@thattommyhall
Created May 16, 2011 16:35
Show Gist options
  • Save thattommyhall/974792 to your computer and use it in GitHub Desktop.
Save thattommyhall/974792 to your computer and use it in GitHub Desktop.
Get filecount, total size, average filesize for Hive tables
current = ''
file_count = 0
total_size = 0
output = File.open('output.csv','w')
IO.popen('hadoop fs -lsr /user/hive/warehouse').each_line do |line|
split = line.split(/\s+/)
#permissions,replication,user,group,size,mod_date,mod_time,path
next unless split.size == 8
path = split[7]
size = split[4]
permissions = split[0]
tablename=path.split('/')[4]
if tablename != current
average_size = file_count == 0 ? 0 : total_size/file_count
result = "#{current},#{file_count},#{total_size},#{average_size}"
unless current==''
puts result
output.puts result
end
total_size = 0
current = tablename
file_count = 0
end
file_count += 1 unless permissions[0] == 'd'
total_size += size.to_i
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment