Skip to content

Instantly share code, notes, and snippets.

@seanorama
Last active February 19, 2020 13:48
Show Gist options
  • Save seanorama/768148de376417afdaa00628c611d27d to your computer and use it in GitHub Desktop.
Save seanorama/768148de376417afdaa00628c611d27d to your computer and use it in GitHub Desktop.
compress files in hdfs

Compress files which are already on HDFS

This hacky method processes 1 file at a time:

  1. copy to a local disk
  2. compress
  3. put back onto HDFS
  4. delete original file from HDFS and compressed file from local disk.

BE CAREFUL: Before executing, inspect the size of each file!

  • The risk is: a single large file could fill the local disk or you could leave the server compressing a single large file for hours.

How

  1. (optional) SSH to a data node. Running from a data node will make it faster, but it isn't required.

  2. (optional) Become HDFS and kinit. You can do this as any user that can access the files.

sudo -u hdfs -i

keytab=/etc/security/keytabs/hdfs.headless.keytab
kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
  1. Change to a partition that is big enough to hold 1-2 of the uncompressed files:

  2. Get list of files (this example is getting Ranger YARN audits)

files=$(hdfs dfs -find /ranger/audit/yarn | grep -Ev "($(date '+%Y%m%d')|$(date -d yesterday +'%Y%m%d'))" | grep .log$)
  1. Compress and remove uncompressed
for file in ${files}; do
  filename="$(basename ${file})"
  filedir="$(dirname ${file})"
  hdfs dfs -copyToLocal "${file}" &&
  gzip "${filename}" &&
  hdfs dfs -moveFromLocal "${filename}".gz "${filedir}/"  &&
  hdfs dfs -stat "${file}.gz" &&
  hdfs dfs -rm -skipTrash "${file}"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment