Skip to content

Instantly share code, notes, and snippets.

@yekm
Last active August 23, 2023 10:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yekm/0f04bb9e1ea3d8fac883744632263348 to your computer and use it in GitHub Desktop.
Save yekm/0f04bb9e1ea3d8fac883744632263348 to your computer and use it in GitHub Desktop.
$ rmlint  -gmk librusec_new // flibusta_new 
...
==> In total 1335219 files, whereof 371965 are duplicates in 370764 groups.
==> This equals 486.93 GB of duplicates which could be removed.
==> 22 other suspicious item(s) found, which may vary in size.
==> Scanning took in total  8h 14m 7.375s.

###

$ du -sh flibusta_new/ librusec_new/
1.5T    flibusta_new/
1.7T    librusec_new/

###

[100%] Done!
Deleting script  ./rmlint.sh

real    20m23.865s
user    6m31.953s
sys     3m23.473s
$ du -sh librusec_new/
1.2T    librusec_new/

cat find . -iname '*.fb2' | parallel --progress --eta -n 2048 'cp --reflink=always -t _all/'
time find _all -iname '*.fb2' | parallel -j200% -k --progress --eta --tag 'enca -e -L ru' > enca.txt
time cat enca.txt | grep -v UTF-8 | cut -f1 | parallel -j800% -k --progress --eta 'enconv -L ru -x UTF-8 _all/{/}'


cat */info.txt | cut -f2- -d'|' | sort | uniq -d | tee index.uniq-d.txt
cat index.uniq-d.txt | head | parallel -k 'grep -h -F {} flibusta_new/info.txt librusec_new/info.txt | cut -f1 -d\| | ( read f; read l; diff -u0 flibusta_new/$f librusec_new/$l)' |less

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment