Skip to content

Instantly share code, notes, and snippets.

@tvwerkhoven
Last active March 14, 2024 17:38
Show Gist options
  • Star 16 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save tvwerkhoven/9a30c6adc7f95a0278e895b5563b900b to your computer and use it in GitHub Desktop.
Save tvwerkhoven/9a30c6adc7f95a0278e895b5563b900b to your computer and use it in GitHub Desktop.
De-duplicate using APFS clonefile(2) and jdupes in zsh
#!/usr/bin/env zsh
#
# # About
#
# Since APFS supports de-duplication on block-level, it can be useful to
# manually de-duplicate your files if you've migrated/upgrade to APFS not
# using a fresh install.
#
# I've written this simple script with the aim to:
# - Be simple, easy to read and understand (for users to check)
# - Use native cp -c for de-duplication (for robustness)
# - Use byte-wise file comparison instead of hashing (while rare, hash collisions are possible)
# - To use jdupes for speed
# - Preserve file metadata via GNU cp
#
# See also this stackexchange thread: https://apple.stackexchange.com/questions/316435/replace-existing-duplicate-files-on-apfs-with-clones
#
# # Known bugs
#
# - Does not preserve target directory timestamps
# - Does not preserve xattrs if larger than 128kb (https://apple.stackexchange.com/questions/226485/copy-extended-attributes-to-new-file-ffmpeg)
#
# # Background info
#
# https://developer.apple.com/documentation/foundation/file_system/about_apple_file_system
# https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf
# https://eclecticlight.co/2019/01/05/aliases-hard-links-symlinks-and-copies-in-mojaves-apfs/
# https://eclecticlight.co/2017/11/02/taking-stock-using-apfs-in-high-sierra-10-13-1/
#
# # Alternatives (https://apple.stackexchange.com/questions/316435/replace-existing-duplicate-files-on-apfs-with-clones)
#
# Python, uses hashes (collision risk): https://github.com/ranvel/clonefile-dedup
# Python, uses hashes (collision risk, does not preserve metadata?): https://bitbucket.org/dchevell/apfs-deduplicate/src/master/
# C, checks for duplication, does not de-duplicate: https://github.com/dyorgio/apfs-clone-checker
# Does not preserve metadata: https://github.com/deckarep/apfs-compactor
# Paid: http://diskdedupe.com/
# Paid: https://macpaw.com/gemini
### Init: identify files and programs
# File to hold duplicate file data
DUPEFILE=./jdupes-output
# File to temporarily store old file for metadata
TEMPFILE=./tmp-preserved-for-metadata
# Determine which methods to use xattr/metadata preservation, can be gcp, preserves most metadata and/or xattr, which preserves up to 128kb of xattributes and is significantly slower
ATTRUSEGCP=1
ATTRUSEXATTR=1
# Critical programs to use
PCP=/bin/cp # Should be Mac native cp supporting clonefile(2)!
PMV=/bin/mv
PGCP=/opt/local/bin/gcp # Should be GNU cp, installed via macports -- Not be confused with alias for git cherry-pick
PJDUPES=/opt/local/bin/jdupes
test ! -x "${PCP}" && echo "Error: path to cp wrong" && exit
test ! -x "${PMV}" && echo "Error: path to mv wrong" && exit
test ! -x "${PGCP}" && echo "Error: path to gnu-cp wrong" && exit
test ! -x "${PJDUPES}" && echo "Error: path to jdupes wrong" && exit
### Optional: check how much data can be saved
${PJDUPES} --recurse --omitfirst ./ | tee ${DUPEFILE}
# Loop over lines, if line is not empty, check size, sum in awk
cat ${DUPEFILE} | while read thisfile; do
test ! -z $thisfile && du -k "$thisfile"
done | awk '{i+=$1} END {print i" kb"}'
### Find duplicates
# Find duplicates, use NUL character to separate to allow for newlines in
# filenames (rare but possible).
${PJDUPES} --print-null --recurse ./ > ${DUPEFILE}
# Check number of sets of duplicates by counting occurence of two consecutive
# NUL characters.
# Count number of NUL characters in file Source: https://stackoverflow.com/questions/371115/count-all-occurrences-of-a-string-in-lots-of-files-with-grep
NPAIRS=$(grep -oaE '\x00\x00' ${DUPEFILE} | wc -l)
echo "Found ${NPAIRS} sets of duplicates"
### Start de-duplication
# Loop over files separated by NUL characters, use first file of paired
# filenames as source for all other files in this set, e.g.
#
# file1\x00
# file2\x00
# file3\x00\x00
#
# will cause file2 and file3 to be overwritten by file1
#
# - If the file is empty, a new set will begin and we will unset SOURCEFILE.
# Also true for the first set we will encounter as SOURCEFILE starts unset
# - If SOURCEFILE is unset, use the current file to set this
# - If the file is not empty AND SOURCEFILE is set, make a copy:
# -- Move the target file to a new temporary location
# -- Clone the source file over the target file
# -- Copy attributes from source file to target file
SOURCEFILE=""
cat ${DUPEFILE} | while read -d $'\0' FILE; do
if [[ -z $FILE ]]; then
SOURCEFILE=""
elif [[ -z $SOURCEFILE ]]; then
SOURCEFILE=${FILE}
else
# Presever original file for metadata
${PMV} "${FILE}" "${TEMPFILE}";
# Test that move was successful
test ! -e "${TEMPFILE}" && echo "Error: move failed on ${FILE}, aborting." && break
# Use cp -c to use APFS clonefile(2)
# Use cp -a to preserve metadata, recurse, and not follow symlinks
${PCP} -ca "${SOURCEFILE}" "${FILE}";
# Test that copy was successful (protect against e.g. empty $PCP string)
test ! -e "${FILE}" && echo "Error: copy failed on ${FILE}, aborting." && break
# Copy over attributes
if [[ "${ATTRUSEGCP}" -eq 1 ]]; then
# Using gnu copy -- xattrs not preserved
# https://unix.stackexchange.com/a/93842
# https://unix.stackexchange.com/questions/402862/cp-losing-files-metadata#402869
# Poorer alternative: https://unix.stackexchange.com/questions/91080/maintain-or-restore-file-permissions-when-replacing-file
${PGCP} --preserve=all --attributes-only "${TEMPFILE}" "${FILE}"
fi
if [[ "${ATTRUSEXATTR}" -eq 1 ]]; then
# Using macOS native xattr, preserving xattrs up to 128K of data
# To properly preserve metadata use the Apple Standard C Library copyfile copyfile(..., COPYFILE_METADATA)
# via https://apple.stackexchange.com/questions/226485/copy-extended-attributes-to-new-file-ffmpeg
# and @worldpoop and @kapitainsky at https://gist.github.com/tvwerkhoven/9a30c6adc7f95a0278e895b5563b900b
set -e
IFS=$'\n' attr_names=($(xattr "${TEMPFILE}"))
for attr in $attr_names; do
value=$(xattr -p -x "${attr}" "${TEMPFILE}" | tr -d " \n")
xattr -w -x "${attr}" "${value}" "${FILE}"
done
fi
fi
done
## Usin fdupes - bash (not tested)
# Get matches
# https://unix.stackexchange.com/questions/34366/is-there-a-way-of-deleting-duplicates-more-refined-than-fdupes-rdn
# DUPEFILE=fdupes-20200101a
# fdupes --sameline --recurse ./ | tee ${DUPEFILE}
# cat ${DUPEFILE} | while read SOURCEFILE DESTFILES; do
# # Split lines by spaces
# # Source https://stackoverflow.com/a/30212526
# read -ra DESTFILESARR <<<${DESTFILES}
# for DEST in "${DESTFILESARR[@]}"; do
# mv "${DEST}" tmp
# echo cp -ca "${SOURCEFILE}" "${DEST}";
# echo gcp --preserve=all --attributes-only tmp "${DEST}"
# done
# done
@jbruchon
Copy link

jdupes now supports APFS clonefile() natively using the same -B/--dedupe option as BTRFS/XFS dedupe on Linux, introduced in commit https://github.com/jbruchon/jdupes/commit/c56ebd06df0d78ef79ee7b4e9be4d54651145811 and available from jdupes v1.17.1 onwards.

@adib
Copy link

adib commented Apr 4, 2022

However, jdupes 1.20.2 would cause data corruption if any files de-duplicated are compressed by the file system.

@worldpoop
Copy link

Ack, @adib, that's pretty serious. I assume that the use of cp in this script circumvents this jdupes defect?

@jbruchon
Copy link

this jdupes defect?

It's not a jdupes defect, it's a macOS defect.

@worldpoop
Copy link

worldpoop commented Aug 13, 2022

Apologies. Mac's lovely defect. This does, then, render jdupe hazardous to use on a Mac for apfs de-duping?

@jbruchon
Copy link

@worldpoop No, this defect only affects files that are compressed using APFS transparent compression, and that's not something most people will encounter normally.

@worldpoop
Copy link

worldpoop commented Aug 13, 2022

There are user-friendly utilities in the app store, and various how-tos on the interwebs, to facilitate transparent compression to help users save disk space -- meaning it's not that rare, especially among users already keen on CLI tools. And this means, owing to the MacOS defect, we have this circumstance in the wild where jdupes can do irreparable damage for a jdupes user. If nothing else, a check or a warning would be really good! .02. Thank for you the tool -- be well! - b

@adib
Copy link

adib commented Aug 13, 2022

Using cp -c should avoid the problem from using jdupes to de-duplicate files on APFS. It’s better if the tool can avoid the user form installing further 3rd party tools, jdupes included. Maybe use sha to find duplicate files?

@jbruchon
Copy link

There was code in an issue report that was supposed to solve the problem within jdupes. It was never sent in as a pull request.

@worldpoop
Copy link

rpendleton's commits, retain flag; check flag?

@worldpoop
Copy link

worldpoop commented Aug 15, 2022

Script is perfect (save for time-stamp), am using it, thank you!! Notes: 1. Requires path adjustments for Monterey. 2. test -x failed verification, had use test -e on my system. (Monterey again?) 3. Seems jdupes option should be -print-null, not -printnull. 4. Permission denied with tee (?), had to sudo run the script, otherwise an inadvertent "dry run".

dyorgio/apfs-clone-checker is pretty good. It doesn't hash, just looks for initial shared blocks, so it's very fast. dedupe.sh runs cp -c even on already deduped files -- a check with clone-checker would avert that (possibly with a few false positives, thanks to no hash, but no real harm.

Minutia: 1. A test confirmed the third-party tool would have ravaged my broadly compressed apfs volume! -- glad to have hit upon the caution and then upon your script, which marries well with jdupes. Indeed, before pulling the trigger I'd been testing with newly created junk files, which of course are not compressed when created, so the danger wasn't revealed.) 2. @adib good note on 3rd-party with this kind of operation -- sha, would be slower? 3. Not a coder, forgive anything stupid I've said. Cheers - b

@tvwerkhoven
Copy link
Author

Hi all, thanks for your comments. I indeed opted for Mac's native cp -c because I hoped that would work better/safer, especially on something so potentially harmful as de-duplication.

@worldpoop:

  1. Which paths do you use? I get gcp and jdupes via macports, which explains the /opt/local/
  2. That's strange, test -x tests for executability so if that fails you cannot execute the programs?
  3. Oops, fixed!
  4. Indeed you need to have writing permissions in the working directory, else you cannot log the dupefiles or de-duplicate.

On your minutia:

  1. Good to hear :)
  2. I disagree, writing a script to check for duplication would largely duplicate jdupes.
  3. No worries, thanks for investigating/testing

@worldpoop
Copy link

Thanks, @tvwerkhoven

Is jdupes checking for identical files or for actual shared blocks? Testing things I recall dedupe.sh executing the cp -c on the first of any found duplicate, whether or not that dupe was already a clone. I may have been careless & will check again tonight. But if that's the case, seems like a good idea not to re-clone any files.

Yup, those files are executable. I have no explanation. On path, macports, got it -- I brewed.

Thanks again!

@jbruchon
Copy link

Is jdupes checking for identical files or for actual shared blocks?

All operations are done by reading the file. There is no filesystem-specific magic going on except for the call to clonefile() at the end of the process. Otherwise, files are treated as discrete entities by jdupes.

@worldpoop
Copy link

worldpoop commented Aug 17, 2022

One oddity I see is that the script will dedup already deduped files. Even though a file is a clone set sharing one location, dedup.sh will run the copy operation (reapply attributes through the tempfile plus change the target's timestamp again). Run it ten times over a season on a tree of five-hundred duplicates, and it will "re-"cp -c every set every time. Is there a reliable to way to filter out already deduped files? The following seems to work in my setup (with a few false negatives, I can live with that), and I'll use it, but of course means adding another executable to the prerequisites. Call it peace of mind of not over-mucking with files that APFS already holds as cloned though whatever historical means....

SOURCEFILE=""
cat ${DUPEFILE} | while read -d $'\0' FILE; do
     if  [[ -z $FILE ]]; then
         SOURCEFILE=""
     elif [[ -z $SOURCEFILE ]]; then
         SOURCEFILE=${FILE}
     else
         # github.com/dyorgio/apfs-clone-checker
         CLONECHECKED="$($CLNCHKR -qf ${SOURCEFILE} ${FILE})"
         if [[ ${CLONECHECKED} == "1" ]]; then
             echo "Skipping ${SOURCEFILE} -> ${FILE} -- already deduped / clones."
         else
             echo "Deduping ${SOURCEFILE} -> ${FILE}"
             # Preseve original file for metadata
             ${PMV} "${FILE}" "${TEMPFILE}";

(@tvwerkhoven)

@tvwerkhoven
Copy link
Author

Hi @worldpoop, that would be possible, but not sure what problem that solves, can you elaborate?

As I see it 1) this would add another dependency 2) there's no harm in copying cloned files.

@kapitainsky
Copy link

kapitainsky commented Sep 1, 2022

this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.

Seems that gcp on macOS is build without xattr support:

# gcp --preserve=xattr --attributes-only sourceFile destFile
gcp: cannot preserve extended attributes, cp is built without xattr support

as a result for three identical files but with different xattr:

# ls
hello.txt  hello1.txt  hello2.txt

# shasum *
750c7735502f7c6072d8b4c9239697302d393480  hello.txt
750c7735502f7c6072d8b4c9239697302d393480  hello1.txt
750c7735502f7c6072d8b4c9239697302d393480  hello2.txt

# xattr -l *
hello.txt: test:
hello1.txt: test2:

# dedupe.sh
Scanning: 4 files, 1 items (in 1 specified)
./hello1.txt
./hello2.txt

3904 kb
Scanning: 4 files, 1 items (in 1 specified)
Found        1 sets of duplicates

# xattr -l *
hello.txt: test:
hello1.txt: test:
hello2.txt: test:

@worldpoop
Copy link

this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.
Seems that gcp on macOS is build without xattr support:

Oh wow, @kapitainsky, you're right. MacOS Monterey -- gcp --preserve=xattr --attributes-only doesn't do a thing for me either.

This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!

`#!/usr/bin/env zsh

if [[ $# -ne 2 ]]; then
print >&2 "usage: copy_xattrs SOURCE_FILE TARGET_FILE"
exit 1
fi
set -e
IFS=$'\n' attr_names=($(xattr "$1"))
for attr in $attr_names; do
value=$(xattr -p -x "$attr" "$1" | tr -d " \n")
xattr -w -x "$attr" "$value" "$2"
done`

@worldpoop
Copy link

worldpoop commented Sep 1, 2022

@tvwerkhoven

not sure what problem that solves, can you elaborate?

I dunno, maybe you're right. Just felt like good form not to do unnecessary file system operations, especially if one is processing a large number of files. Running your inspired script on my end, took a good bit of time to do a large number of files on first run. (I have it recursing.) And yes, clone_checking I'm sure adds to that time, though jdupes still seems to be the lengthiest portion, sensibly.

Of note, I changed it to copy rather than move the target to tempfile, because if you hit error/break, you gotta stop, find the tempfile or original source and restore the blitzed target manually. It's cp -c to a temp on the same volume -- that way, especially for large video files (I have prores files up to 100GB), it's an instant copy no extra space is used anywhere by the tempfile -- then rm the tempfile at the end of each loop.

One small lump of havoc I hit is that gcp cannot handle any file names with "\" in them. cp, mv, and jdupes too for that matter, are fine, but I could find no way to get gcp to not always read "" as an escape no matter how I wrapped it. (I have a folder tree from an assistant editor full of audio stems all prefixed with \ ("1\6" "3\12" etc., meaning this file is first of six, that one third of twelve tracks and so on.) So, well, now on two counts gcp is to the curb!

@kapitainsky
Copy link

This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!

as it is clear from the start that it does not work always this is not really solution....

I did a bit poking around and seems that only way to do this properly is to use Apple Standard C Library copyfile copyfile(..., COPYFILE_METADATA)

@worldpoop
Copy link

You mean xattr -x doesn't always work on Mac? Good poking on standard library.

@tvwerkhoven
Copy link
Author

this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.
Seems that gcp on macOS is build without xattr support:

Oh wow, @kapitainsky, you're right. MacOS Monterey -- gcp --preserve=xattr --attributes-only doesn't do a thing for me either.

This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!

Good catch, thanks. I added it, including the notice that metadata preservation is not 100% reliable. Also, xattr() seems quite slow (on my system), so I added the option to use either gcp or xattr, in spite of the issues you mentioned @worldpoop ;) Using the Apple Standard C lib is a bit beyond the scope of what I intended here @kapitainsky, but perhaps this helps somebody else.

Also, I'd encourage people to contribute a pull request to jdupes so this all converges nicely in a more mature project :)

@kapitainsky
Copy link

yeap - this is why I shared it here - seems solution is beyond simple scrpting but maybe somebody will move it further

@kapitainsky
Copy link

this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.
Seems that gcp on macOS is build without xattr support:

Oh wow, @kapitainsky, you're right. MacOS Monterey -- gcp --preserve=xattr --attributes-only doesn't do a thing for me either.
This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!

Good catch, thanks. I added it, including the notice that metadata preservation is not 100% reliable. Also, xattr() seems quite slow (on my system), so I added the option to use either gcp or xattr, in spite of the issues you mentioned @worldpoop ;) Using the Apple Standard C lib is a bit beyond the scope of what I intended here @kapitainsky, but perhaps this helps somebody else.

Also, I'd encourage people to contribute a pull request to jdupes so this all converges nicely in a more mature project :)

Looking for the best deduplicator for my needs for now I stick with https://github.com/pkolaczk/fclones - thx to multithreading 10x faster than jdupes on SSD disk (which is any mac unless some ancient), does not have issues with in place compression data corruption, had problems with xattr but it has been fixed this weekend. Nothing is perfect thought it does not have byte-wise file comparison but uses hashes. Good that there are many tools around to choose one for the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment