seanorama/hfds-compress-files.md

## hfds-compress-files.md

      
    Raw
  

              hfds-compress-files.md
            
          
    Compress files which are already on HDFS

This hacky method processes 1 file at a time:

copy to a local disk
compress
put back onto HDFS
delete original file from HDFS and compressed file from local disk.

BE CAREFUL: Before executing, inspect the size of each file!

The risk is: a single large file could fill the local disk or you could leave the server compressing a single large file for hours.

How


(optional) SSH to a data node. Running from a data node will make it faster, but it isn't required.


(optional) Become HDFS and kinit. You can do this as any user that can access the files.


sudo -u hdfs -i

keytab=/etc/security/keytabs/hdfs.headless.keytab
kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)


Change to a partition that is big enough to hold 1-2 of the uncompressed files:


Get list of files (this example is getting Ranger YARN audits)


files=$(hdfs dfs -find /ranger/audit/yarn | grep -Ev "($(date '+%Y%m%d')|$(date -d yesterday +'%Y%m%d'))" | grep .log$)


Compress and remove uncompressed

for file in ${files}; do
  filename="$(basename ${file})"
  filedir="$(dirname ${file})"
  hdfs dfs -copyToLocal "${file}" &&
  gzip "${filename}" &&
  hdfs dfs -moveFromLocal "${filename}".gz "${filedir}/"  &&
  hdfs dfs -stat "${file}.gz" &&
  hdfs dfs -rm -skipTrash "${file}"
done