Skip to content

Instantly share code, notes, and snippets.

@FiXato
Last active February 28, 2020 05:01
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save FiXato/11ae99b4110dae8d986763964c170873 to your computer and use it in GitHub Desktop.
Save FiXato/11ae99b4110dae8d986763964c170873 to your computer and use it in GitHub Desktop.
Get an overview of likely byte-for-byte duplicates in a zip file, using 7z, gawk and grep, based on CRC+filesize
function process_group(name, body) {
size = gensub(/.+\nSize = ([0-9]+)\n.+/, "\\1", "G", body);
crc = gensub(/.+\nCRC = ([A-F0-9]{8})\n.+/, "\\1", "G", body);
packed = gensub(/.+\nPacked Size = ([0-9]+)\n.+/, "\\1", "G", body);
modified = gensub(/.+\nModified = ([A-F0-9]{8})\n.+/, "\\1", "G", body);
id = crc "-" size;
uniques[id]++;
data[id]["size"] = size;
data[id]["crc"] = crc;
data[id]["totalsize"] += size;
data[id]["files"] = (data[id]["files"]==""?name:data[id]["files"]";"name);
data[id][name]["modified"] = modified;
data[id][name]["packed"] = packed;
}
BEGIN {
printf("%-12s %-12s %6s %-20s %s\n",
"Wasted",
"Total",
#"Size",
"Count",
"CRC-Size",
"Files")
RS="(^|\n*)Path = [^\n\r]+"
PREV=""
}
{
if (PREV!="") {
process_group(gensub(/^\n*Path = ([^\n\r]+)/, "\\1", 1, PREV), $0);
}
PREV=RT
}
END {
total_size = 0
total_ideal = 0
for (key in uniques) {
total_size += data[key]["totalsize"];
total_ideal += data[key]["size"];
count = uniques[key];
printf("%-12i %-12i %6i %-20s %s\n",
(data[key]["totalsize"] - data[key]["size"]),
data[key]["totalsize"],
#data[key]["size"],
count,
key,
data[key]["files"])
}
printf("== %10s: %20i\n", "Ideal", total_ideal);
printf("== %10s: %20i\n", "Used", total_size);
}
7z l -slt -ba $1 | gawk -f ~/bin/7z-list-duplicates.awk | sort -n | egrep -v '^0'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment