Skip to content

Instantly share code, notes, and snippets.

@tvwerkhoven
Last active March 14, 2024 17:38
Show Gist options
  • Save tvwerkhoven/9a30c6adc7f95a0278e895b5563b900b to your computer and use it in GitHub Desktop.
Save tvwerkhoven/9a30c6adc7f95a0278e895b5563b900b to your computer and use it in GitHub Desktop.
De-duplicate using APFS clonefile(2) and jdupes in zsh
#!/usr/bin/env zsh
#
# # About
#
# Since APFS supports de-duplication on block-level, it can be useful to
# manually de-duplicate your files if you've migrated/upgrade to APFS not
# using a fresh install.
#
# I've written this simple script with the aim to:
# - Be simple, easy to read and understand (for users to check)
# - Use native cp -c for de-duplication (for robustness)
# - Use byte-wise file comparison instead of hashing (while rare, hash collisions are possible)
# - To use jdupes for speed
# - Preserve file metadata via GNU cp
#
# See also this stackexchange thread: https://apple.stackexchange.com/questions/316435/replace-existing-duplicate-files-on-apfs-with-clones
#
# # Known bugs
#
# - Does not preserve target directory timestamps
# - Does not preserve xattrs if larger than 128kb (https://apple.stackexchange.com/questions/226485/copy-extended-attributes-to-new-file-ffmpeg)
#
# # Background info
#
# https://developer.apple.com/documentation/foundation/file_system/about_apple_file_system
# https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf
# https://eclecticlight.co/2019/01/05/aliases-hard-links-symlinks-and-copies-in-mojaves-apfs/
# https://eclecticlight.co/2017/11/02/taking-stock-using-apfs-in-high-sierra-10-13-1/
#
# # Alternatives (https://apple.stackexchange.com/questions/316435/replace-existing-duplicate-files-on-apfs-with-clones)
#
# Python, uses hashes (collision risk): https://github.com/ranvel/clonefile-dedup
# Python, uses hashes (collision risk, does not preserve metadata?): https://bitbucket.org/dchevell/apfs-deduplicate/src/master/
# C, checks for duplication, does not de-duplicate: https://github.com/dyorgio/apfs-clone-checker
# Does not preserve metadata: https://github.com/deckarep/apfs-compactor
# Paid: http://diskdedupe.com/
# Paid: https://macpaw.com/gemini
### Init: identify files and programs
# File to hold duplicate file data
DUPEFILE=./jdupes-output
# File to temporarily store old file for metadata
TEMPFILE=./tmp-preserved-for-metadata
# Determine which methods to use xattr/metadata preservation, can be gcp, preserves most metadata and/or xattr, which preserves up to 128kb of xattributes and is significantly slower
ATTRUSEGCP=1
ATTRUSEXATTR=1
# Critical programs to use
PCP=/bin/cp # Should be Mac native cp supporting clonefile(2)!
PMV=/bin/mv
PGCP=/opt/local/bin/gcp # Should be GNU cp, installed via macports -- Not be confused with alias for git cherry-pick
PJDUPES=/opt/local/bin/jdupes
test ! -x "${PCP}" && echo "Error: path to cp wrong" && exit
test ! -x "${PMV}" && echo "Error: path to mv wrong" && exit
test ! -x "${PGCP}" && echo "Error: path to gnu-cp wrong" && exit
test ! -x "${PJDUPES}" && echo "Error: path to jdupes wrong" && exit
### Optional: check how much data can be saved
${PJDUPES} --recurse --omitfirst ./ | tee ${DUPEFILE}
# Loop over lines, if line is not empty, check size, sum in awk
cat ${DUPEFILE} | while read thisfile; do
test ! -z $thisfile && du -k "$thisfile"
done | awk '{i+=$1} END {print i" kb"}'
### Find duplicates
# Find duplicates, use NUL character to separate to allow for newlines in
# filenames (rare but possible).
${PJDUPES} --print-null --recurse ./ > ${DUPEFILE}
# Check number of sets of duplicates by counting occurence of two consecutive
# NUL characters.
# Count number of NUL characters in file Source: https://stackoverflow.com/questions/371115/count-all-occurrences-of-a-string-in-lots-of-files-with-grep
NPAIRS=$(grep -oaE '\x00\x00' ${DUPEFILE} | wc -l)
echo "Found ${NPAIRS} sets of duplicates"
### Start de-duplication
# Loop over files separated by NUL characters, use first file of paired
# filenames as source for all other files in this set, e.g.
#
# file1\x00
# file2\x00
# file3\x00\x00
#
# will cause file2 and file3 to be overwritten by file1
#
# - If the file is empty, a new set will begin and we will unset SOURCEFILE.
# Also true for the first set we will encounter as SOURCEFILE starts unset
# - If SOURCEFILE is unset, use the current file to set this
# - If the file is not empty AND SOURCEFILE is set, make a copy:
# -- Move the target file to a new temporary location
# -- Clone the source file over the target file
# -- Copy attributes from source file to target file
SOURCEFILE=""
cat ${DUPEFILE} | while read -d $'\0' FILE; do
if [[ -z $FILE ]]; then
SOURCEFILE=""
elif [[ -z $SOURCEFILE ]]; then
SOURCEFILE=${FILE}
else
# Presever original file for metadata
${PMV} "${FILE}" "${TEMPFILE}";
# Test that move was successful
test ! -e "${TEMPFILE}" && echo "Error: move failed on ${FILE}, aborting." && break
# Use cp -c to use APFS clonefile(2)
# Use cp -a to preserve metadata, recurse, and not follow symlinks
${PCP} -ca "${SOURCEFILE}" "${FILE}";
# Test that copy was successful (protect against e.g. empty $PCP string)
test ! -e "${FILE}" && echo "Error: copy failed on ${FILE}, aborting." && break
# Copy over attributes
if [[ "${ATTRUSEGCP}" -eq 1 ]]; then
# Using gnu copy -- xattrs not preserved
# https://unix.stackexchange.com/a/93842
# https://unix.stackexchange.com/questions/402862/cp-losing-files-metadata#402869
# Poorer alternative: https://unix.stackexchange.com/questions/91080/maintain-or-restore-file-permissions-when-replacing-file
${PGCP} --preserve=all --attributes-only "${TEMPFILE}" "${FILE}"
fi
if [[ "${ATTRUSEXATTR}" -eq 1 ]]; then
# Using macOS native xattr, preserving xattrs up to 128K of data
# To properly preserve metadata use the Apple Standard C Library copyfile copyfile(..., COPYFILE_METADATA)
# via https://apple.stackexchange.com/questions/226485/copy-extended-attributes-to-new-file-ffmpeg
# and @worldpoop and @kapitainsky at https://gist.github.com/tvwerkhoven/9a30c6adc7f95a0278e895b5563b900b
set -e
IFS=$'\n' attr_names=($(xattr "${TEMPFILE}"))
for attr in $attr_names; do
value=$(xattr -p -x "${attr}" "${TEMPFILE}" | tr -d " \n")
xattr -w -x "${attr}" "${value}" "${FILE}"
done
fi
fi
done
## Usin fdupes - bash (not tested)
# Get matches
# https://unix.stackexchange.com/questions/34366/is-there-a-way-of-deleting-duplicates-more-refined-than-fdupes-rdn
# DUPEFILE=fdupes-20200101a
# fdupes --sameline --recurse ./ | tee ${DUPEFILE}
# cat ${DUPEFILE} | while read SOURCEFILE DESTFILES; do
# # Split lines by spaces
# # Source https://stackoverflow.com/a/30212526
# read -ra DESTFILESARR <<<${DESTFILES}
# for DEST in "${DESTFILESARR[@]}"; do
# mv "${DEST}" tmp
# echo cp -ca "${SOURCEFILE}" "${DEST}";
# echo gcp --preserve=all --attributes-only tmp "${DEST}"
# done
# done
@kapitainsky
Copy link

this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.
Seems that gcp on macOS is build without xattr support:

Oh wow, @kapitainsky, you're right. MacOS Monterey -- gcp --preserve=xattr --attributes-only doesn't do a thing for me either.
This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!

Good catch, thanks. I added it, including the notice that metadata preservation is not 100% reliable. Also, xattr() seems quite slow (on my system), so I added the option to use either gcp or xattr, in spite of the issues you mentioned @worldpoop ;) Using the Apple Standard C lib is a bit beyond the scope of what I intended here @kapitainsky, but perhaps this helps somebody else.

Also, I'd encourage people to contribute a pull request to jdupes so this all converges nicely in a more mature project :)

Looking for the best deduplicator for my needs for now I stick with https://github.com/pkolaczk/fclones - thx to multithreading 10x faster than jdupes on SSD disk (which is any mac unless some ancient), does not have issues with in place compression data corruption, had problems with xattr but it has been fixed this weekend. Nothing is perfect thought it does not have byte-wise file comparison but uses hashes. Good that there are many tools around to choose one for the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment