This hacky method processes 1 file at a time:
- copy to a local disk
- compress
- put back onto HDFS
- delete original file from HDFS and compressed file from local disk.
BE CAREFUL: Before executing, inspect the size of each file!
- The risk is: a single large file could fill the local disk or you could leave the server compressing a single large file for hours.
-
(optional) SSH to a data node. Running from a data node will make it faster, but it isn't required.
-
(optional) Become HDFS and kinit. You can do this as any user that can access the files.
sudo -u hdfs -i
keytab=/etc/security/keytabs/hdfs.headless.keytab
kinit -kt ${keytab} $(klist -kt ${keytab}| awk '{print $NF}'|tail -1)
-
Change to a partition that is big enough to hold 1-2 of the uncompressed files:
-
Get list of files (this example is getting Ranger YARN audits)
files=$(hdfs dfs -find /ranger/audit/yarn | grep -Ev "($(date '+%Y%m%d')|$(date -d yesterday +'%Y%m%d'))" | grep .log$)
- Compress and remove uncompressed
for file in ${files}; do
filename="$(basename ${file})"
filedir="$(dirname ${file})"
hdfs dfs -copyToLocal "${file}" &&
gzip "${filename}" &&
hdfs dfs -moveFromLocal "${filename}".gz "${filedir}/" &&
hdfs dfs -stat "${file}.gz" &&
hdfs dfs -rm -skipTrash "${file}"
done