Skip to content

Instantly share code, notes, and snippets.

@majorgreys
Last active April 6, 2022 18:55
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save majorgreys/8ee2e712044f5c25f7ba6a75505f6a21 to your computer and use it in GitHub Desktop.
Save majorgreys/8ee2e712044f5c25f7ba6a75505f6a21 to your computer and use it in GitHub Desktop.
mbsync+mu cleanup

In moving my email from gmail to outlook, it would seem that I’ve ended up with multiple copies of emails. How many such emails there are and how to remove them I am still struggling to figure out. The problem seems to be that the same message ID but have different X-TUID.

$ mu find msgid:CY1PR15MB0155CB9AAD45DA010FFFC2FCF15E0@CY1PR15MB0155.namprd15.prod.outlook.com -f 'l'
/home/tbutt/.mail/outlook/Sent/cur/1522002651.16857_33.knuckles,U=36:2,S
/home/tbutt/.mail/outlook/Archive/cur/1522002326.15658_3659.knuckles,U=236990:2,S
/home/tbutt/.mail/outlook/Archive/cur/1521978114.18576_998.knuckles,U=2593:2,S
/home/tbutt/.mail/gc/Archive/cur/1522211460.25821_33.knuckles,U=34:2,S
/home/tbutt/.mail/outlook/Archive/cur/1521979684.21093_9183.knuckles,U=20859:2,S

$ diff /home/tbutt/.mail/outlook/Archive/cur/1522002326.15658_3659.knuckles,U=236990:2,S /home/tbutt/.mail/outlook/Archive/cur/1521978114.18576_998.knuckles,U=2593:2,S
< X-TUID: AR/2lqM1OiYM
24a24
> X-TUID: Brt0VWOJb91f

$ md5sum /home/tbutt/.mail/outlook/Archive/cur/1522002326.15658_3659.knuckles,U=236990:2,S /home/tbutt/.mail/outlook/Archive/cur/1521978114.18576_998.knuckles,U=2593:2,S
ceafb53ef363ecde8fe77d270c7bef13  /home/tbutt/.mail/outlook/Archive/cur/1522002326.15658_3659.knuckles,U=236990:2,S
7caa6b1f87405f945966b1916b5eaf07  /home/tbutt/.mail/outlook/Archive/cur/1521978114.18576_998.knuckles,U=2593:2,S

Doing a search for X-TUID brings up a thread on mu-discuss on this specific issue. The solution is to wrap md5sum that find-dups.scm calls to replace this header. Running the modified script, I find there are 16040 messages with duplicate md5sums in 16236 files (meaning a few have more than one duplicate).

$ mu index -m ~/.mail/
indexing messages under /home/tbutt/.mail [/home/tbutt/.mu/xapian]
- processing mail; processed: 237225; updated/new: 0, cleaned-up: 0
cleaning up messages [/home/tbutt/.mu/xapian]
- processing mail; processed: 245400; updated/new: 0, cleaned-up: 15617
#!/bin/sh
exec guile2.0 -e main -s $0 $@
!#
;; INFO: find duplicate messages
;; INFO: options:
;; INFO: --muhome=<muhome>: path to mu home dir
;; INFO: --delete: delete all but the first one (experimental, be careful!)
(use-modules (mu) (mu script) (mu stats))
(use-modules (ice-9 getopt-long) (ice-9 optargs)
(ice-9 popen) (ice-9 format) (ice-9 rdelim)
(ice-9 pretty-print))
(define (md5sum path)
(let* ((port (open-pipe* OPEN_READ "./thb-md5sum" path))
(md5 (read-delimited " " port)))
(close-pipe port)
md5))
(define (find-dups delete expr)
(let ((id-table (make-hash-table 20000)))
;; fill the hash with <msgid-size> => <list of paths>
(mu:for-each-message
(lambda (msg)
(let* ((id (format #f "~a-~d" (mu:message-id msg)
(mu:size msg)))
(lst (hash-ref id-table id)))
(if lst
(set! lst (cons (mu:path msg) lst))
(set! lst (list (mu:path msg))))
(hash-set! id-table id lst)))
expr)
;; list all the paths with multiple elements; check the md5sum to
;; make 100%-minus-ε sure they are really the same file.
(hash-for-each
(lambda (id paths)
(if (> (length paths) 1)
(let ((hash (make-hash-table 10)))
(for-each
(lambda (path)
(when (file-exists? path)
(let* ((md5 (md5sum path))
(lst (hash-ref hash md5)))
(if lst
(set! lst (cons path lst))
(set! lst (list path)))
(hash-set! hash md5 lst))))
paths)
;; (display (hash-count (const #t) hash))
;; hash now maps the md5sum to the messages...
(hash-for-each
(lambda (md5 mpaths)
(if (> (length mpaths) 1)
(begin
(format #t "md5sum: ~a:\n" md5)
(let ((num 1))
(for-each
(lambda (path)
(if (equal? num 1)
(format #t "~a\n" path)
(begin
(format #t "~a: ~a\n" (if delete "deleting" "dup") path)
(if delete (delete-file path))))
(set! num (+ 1 num)))
mpaths)))))
hash))))
id-table)))
(define (main args)
"Find duplicate messages and, potentially, delete the dups.
Be careful with that!
Interpret argument-list ARGS (like command-line
arguments). Possible arguments are:
--muhome (path to alternative mu home directory).
--delete (delete all but the first one). Run mu index afterwards.
--expr (expression to constrain search)."
(setlocale LC_ALL "")
(let* ((optionspec '( (muhome (value #t))
(delete (value #f))
(expr (value #t))
(help (single-char #\h) (value #f))))
(options (getopt-long args optionspec))
(help (option-ref options 'help #f))
(delete (option-ref options 'delete #f))
(expr (option-ref options 'expr #t))
(muhome (option-ref options 'muhome #f)))
(mu:initialize muhome)
(find-dups delete expr)))
;; Local Variables:
;; mode: scheme
;; End:
#!/usr/bin/env sh
perl -ne 'print unless /^X-TUID:.*/' $1 | md5sum
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment