Skip to content

Instantly share code, notes, and snippets.

@pkern
Forked from fwenzel/cleanup-maildir.py
Created September 16, 2012 00:24
Show Gist options
  • Star 20 You must be signed in to star a gist
  • Fork 7 You must be signed in to fork a gist
  • Save pkern/3730543 to your computer and use it in GitHub Desktop.
Save pkern/3730543 to your computer and use it in GitHub Desktop.
A script for cleaning up mails in Maildir folders, with proper threading support
#!/bin/sh
BASE=$HOME/Maildir
ARCHIVEBASE=$HOME/Maildir/archive.
for folder in `find $BASE -maxdepth 1 -type d \! -regex '.*/archive\..*' \! -name cur \! -name tmp \! -name new`
do
folder=$(basename $folder)
if [ "${folder}" = "Maildir" ]; then folder=INBOX; fi
./cleanup-maildir.py --archive-folder=${ARCHIVEBASE}${folder} --maildir-root=$BASE --folder-prefix= --age=365 -d 1 -k -u -v archive ${folder}
done
#!/usr/bin/python -tt
# vim:set et ts=4 sw=4 ai:
"""
USAGE
cleanup-maildir [OPTION].. COMMAND FOLDERNAME..
DESCRIPTION
Cleans up old messages in FOLDERNAME; the exact action taken
depends on COMMAND. (See next section.)
Note that FOLDERNAME is a name such as 'Drafts', and the
corresponding maildir path is determined using the values of
maildir-root, folder-prefix, and folder-seperator.
COMMANDS
archive - move old messages to subfolders based on message date
trash - move old message to trash folder
delete - permanently delete old messages
OPTIONS
-h, --help
Show this help.
-q, --quiet
Suppress normal output.
-v, --verbose
Output extra information for testing.
-n, --trial-run
Do not actually touch any files; just say what would be done.
-a, --age=N
Only touch messages older than N days. Default is 14 days.
-k, --keep-flagged-threads
If any messages in a thread are flagged, do not touch them or
any other messages in that thread.
-u, --keep-unread-threads
If any messages in a thread are unread, do not touch them or any
other messages in that thread.
-r, --keep-read
If any messages are flagged as READ, do not touch them.
-t, --trash-folder=F
Use F as trash folder when COMMAND is 'trash'.
Default is 'Trash'.
--archive-folder=F
Use F as the base for constructing archive folders. For example, if F is
'Archive', messages from 2004 might be put in the folder 'Archive.2004'.
-d, --archive-hierarchy-depth=N
Specify number of subfolders in archive hierarchy; 1 is just
the year, 2 is year/month (default), 3 is year/month/day.
--maildir-root=F
Specifies folder that contains mail folders.
Default is "$HOME/Maildir".
--folder-seperator=str
Folder hierarchy seperator. Default is '.'
--folder-prefix=str
Folder prefix. Default is '.'
NOTES
The following form is accepted for backwards compatibility, but is deprecated:
cleanup-maildir --mode=COMMAND [OPTION].. FOLDERNAME..
EXAMPLES
# Archive messages in 'Sent Items' folder over 30 days old
cleanup-maildir --age=30 archive 'Sent Items'"
# Delete messages over 2 weeks old in 'Lists/debian-devel' folder,
# except messages that are part of a thread containing a flagged message.
cleanup-maildir --keep-flagged-threads trash 'Lists.debian-devel'
"""
__version__ = "0.2.3"
# $Id$
# $URL$
from pygraph.classes.graph import graph
from pygraph.algorithms.traversal import traversal
import email.Header
import getopt
import logging
import mailbox
import os
import os.path
import re
import rfc822
import socket
import string
import sys
import time
def mkMaildir(path):
"""Make a Maildir structure rooted at 'path'"""
os.mkdir(path, 0700)
os.mkdir(os.path.join(path, 'tmp'), 0700)
os.mkdir(os.path.join(path, 'new'), 0700)
os.mkdir(os.path.join(path, 'cur'), 0700)
class MaildirWriter(object):
"""Deliver messages into a Maildir"""
path = None
counter = 0
def __init__(self, path=None):
"""Create a MaildirWriter that manages the Maildir at 'path'
Arguments:
path -- if specified, used as the default Maildir for this object
"""
if path != None:
if not os.path.isdir(path):
raise ValueError, 'Path does not exist: %s' % path
self.path = path
self.logger = logging.getLogger('MaildirWriter')
def deliver(self, msg, path=None):
"""Deliver a message to a Maildir
Arguments:
msg -- a message object
path -- the path of the Maildir; if None, uses default from __init__
"""
if path != None:
self.path = path
if self.path == None or not os.path.isdir(self.path):
raise ValueError, 'Path does not exist'
tryCount = 1
srcFile = msg.fp._file.name;
(dstName, tmpFile, newFile, dstFile) = (None, None, None, None)
while 1:
try:
dstName = "%d.%d_%d.%s" % (int(time.time()), os.getpid(),
self.counter, socket.gethostname())
tmpFile = os.path.join(os.path.join(self.path, "tmp"), dstName)
newFile = os.path.join(os.path.join(self.path, "new"), dstName)
self.logger.debug("deliver: attempt copy %s to %s" %
(srcFile, tmpFile))
os.link(srcFile, tmpFile) # Copy into tmp
self.logger.debug("deliver: attempt link to %s" % newFile)
os.link(tmpFile, newFile) # Link into new
except OSError, (n, s):
self.logger.critical(
"deliver failed: %s (src=%s tmp=%s new=%s i=%d)" %
(s, srcFile, tmpFile, newFile, tryCount))
self.logger.info("sleeping")
time.sleep(2)
tryCount += 1
self.counter += 1
if tryCount > 10:
raise OSError("too many failed delivery attempts")
else:
break
# Successful delivery; increment deliver counter
self.counter += 1
# For the rest of this method we are acting as an MUA, not an MDA.
# Move message to cur and restore any flags
dstFile = os.path.join(os.path.join(self.path, "cur"), dstName)
if msg.getFlags() != None:
dstFile += ':' + msg.getFlags()
self.logger.debug("deliver: attempt link to %s" % dstFile)
os.link(newFile, dstFile)
os.unlink(newFile)
# Cleanup tmp file
os.unlink(tmpFile)
class MessageDateError(TypeError):
"""Indicate that the message date was invalid"""
pass
class MaildirMessage(rfc822.Message):
"""An email message
Has extra Maildir-specific attributes
"""
def isFlagged(self):
"""return true if the message is flagged as important"""
fname = self.fp._file.name
if re.search(r':.*F', fname) != None:
return True
return False
def getFlags(self):
"""return the flag part of the message's filename"""
parts = self.fp._file.name.split(':')
if len(parts) == 2:
return parts[1]
return None
def isNew(self):
"""return true if the message is marked as unread"""
# XXX should really be called isUnread
fname = self.fp._file.name
if re.search(r':.*S', fname) != None:
return False
return True
def getSubject(self):
"""get the message's subject as a unicode string"""
s = self.getheader("Subject")
try:
return u"".join(map(lambda x: x[0].decode(x[1] or 'ASCII', 'replace'),
email.Header.decode_header(s)))
except(LookupError):
return s
def getSubjectHash(self):
"""get the message's subject in a "normalized" form
This currently means lowercasing and removing any reply or forward
indicators.
"""
s = self.getSubject()
if s == None:
return '(no subject)'
return re.sub(r'^(re|fwd?):\s*', '', string.strip(s.lower()))
def getMessageId(self):
return self.getheader('Message-ID')
def getInReplyTo(self):
irt = self.getheader('In-Reply-To')
if irt is None:
return None
# Handle an empty In-Reply-To gracefully (RT does generate those).
if len(irt.strip()) == 0:
return None
return irt
def getReferences(self):
references = self.getheader('References')
if references is None:
return []
return [mid for mid in re.split('\s+', references) if mid[0] == '<' and mid[-1] == '>']
def getDateSent(self):
"""Get the time of sending from the Date header
Returns a time object using time.mktime. Not very reliable, because
the Date header can be missing or spoofed (and often is, by spammers).
Throws a MessageDateError if the Date header is missing or invalid.
"""
dh = self.getheader('Date')
if dh == None:
return None
try:
return time.mktime(rfc822.parsedate(dh))
except ValueError:
raise MessageDateError("message has missing or bad Date")
except TypeError: # gets thrown by mktime if parsedate returns None
raise MessageDateError("message has missing or bad Date")
except OverflowError:
raise MessageDateError("message has missing or bad Date")
def getDateRecd(self):
"""Get the time the message was received"""
# XXX check that stat returns time in UTC, fix if not
return os.stat(self.fp._file.name)[8]
def getDateSentOrRecd(self):
"""Get the time the message was sent, fall back on time received"""
try:
d = self.getDateSent()
if d != None:
return d
except MessageDateError:
pass
return self.getDateRecd()
def getAge(self):
"""Get the number of seconds since the message was received"""
msgTime = self.getDateRecd()
msgAge = time.mktime(time.gmtime()) - msgTime
return msgAge / (60*60*24)
class MaildirCleaner(object):
"""Clean a maildir by deleting or moving old messages"""
__trashWriter = None
__mdWriter = None
stats = {'total': 0, 'delete': 0, 'trash': 0, 'archive': 0}
keepSubjects = {}
archiveFolder = None
archiveHierDepth = 2
folderBase = None
folderPrefix = "."
folderSeperator = "."
keepFlaggedThreads = False
keepUnreadThreads = False
trashFolder = "Trash"
isTrialRun = False
keepRead = False
def __init__(self, folderBase=None):
"""Initialize the MaildirCleaner
Arguments:
folderBase -- the directory in which the folders are found
"""
self.folderBase = folderBase
self.__mdWriter = MaildirWriter()
self.logger = logging.getLogger('MaildirCleaner')
self.logger.setLevel(logging.DEBUG)
def __getTrashWriter(self):
if not self.__trashWriter:
path = os.path.join(self.folderBase, self.folderPrefix + self.trashFolder)
self.__trashWriter = MaildirWriter(path)
return self.__trashWriter
trashWriter = property(__getTrashWriter)
def scanSubjects(self, folderName):
"""Scans for flagged subjects"""
self.logger.info("Scanning threads...")
if (folderName == 'INBOX'):
path = self.folderBase
else:
path = os.path.join(self.folderBase, self.folderPrefix + folderName)
maildir = mailbox.Maildir(path, MaildirMessage)
self.keepMsgIds = dict()
wantedMsgIds = list()
references = graph()
for i, msg in enumerate(maildir):
if i % 1000 == 0:
self.logger.debug("Processed %d mails...", i)
mid, irt = msg.getMessageId(), msg.getInReplyTo()
if mid is None:
self.logger.debug("Mail without a message ID found (%d): %s", i, msg.getSubjectHash())
continue
if not references.has_node(mid):
references.add_node(mid)
if not references.has_edge((mid, mid)):
references.add_edge((mid, mid))
if irt is not None:
if not references.has_node(irt):
references.add_node(irt)
if not references.has_edge((mid, irt)):
references.add_edge((mid, irt))
# Add references header as well, as intermediate messages
# might be saved in the Sent folder.
for ref in msg.getReferences():
if not references.has_node(ref):
references.add_node(ref)
if not references.has_edge((mid, ref)):
references.add_edge((mid, ref))
if self.keepFlaggedThreads and msg.isFlagged():
wantedMsgIds.append(mid)
self.logger.debug("Flagged (%d): %s -- %s", i, msg.getSubjectHash(), mid)
if self.keepUnreadThreads and msg.isNew():
wantedMsgIds.append(mid)
self.logger.debug("Unread (%d): %s -- %s", i, msg.getSubjectHash(), mid)
for wmid in wantedMsgIds:
for tmid in traversal(references, wmid, 'pre'):
self.keepMsgIds[tmid] = 1
self.logger.debug("Keeping %s (part of wanted %s)", tmid, wmid)
self.logger.info("Done scanning.")
def clean(self, mode, folderName, minAge):
"""Trashes or archives messages older than minAge days
Arguments:
mode -- the cleaning mode. Valid modes are:
trash -- moves the messages to a trash folder
archive -- moves the messages to folders based on their date
delete -- deletes the messages
folderName -- the name of the folder on which to operate
This is a name like "Stuff", not a filename
minAge -- messages younger than minAge days are left alone
"""
if not mode in ('trash', 'archive', 'delete'):
raise ValueError
if (self.keepFlaggedThreads or self.keepUnreadThreads):
self.scanSubjects(folderName)
archiveFolder = self.archiveFolder
if (archiveFolder == None):
if (folderName == 'INBOX'):
archiveFolder = ""
else:
archiveFolder = folderName
if (folderName == 'INBOX'):
path = self.folderBase
else:
path = os.path.join(self.folderBase, self.folderPrefix + folderName)
maildir = mailbox.Maildir(path, MaildirMessage)
fakeMsg = ""
if self.isTrialRun:
fakeMsg = "(Not really) "
# Move old messages
for i, msg in enumerate(maildir):
if self.keepFlaggedThreads == True \
and msg.getMessageId() in self.keepMsgIds:
self.log(logging.DEBUG, "Keeping #%d (topic flagged)" % i, msg)
else:
if (msg.getAge() >= minAge) and ((not self.keepRead) or (self.keepRead and msg.isNew())):
if mode == 'trash':
self.log(logging.INFO, "%sTrashing #%d (old)" %
(fakeMsg, i), msg)
if not self.isTrialRun:
self.trashWriter.deliver(msg)
os.unlink(msg.fp._file.name)
elif mode == 'delete':
self.log(logging.INFO, "%sDeleting #%d (old)" %
(fakeMsg, i), msg)
if not self.isTrialRun:
os.unlink(msg.fp._file.name)
else: # mode == 'archive'
# Determine subfolder path
mdate = time.gmtime(msg.getDateSentOrRecd())
datePart = str(mdate[0])
if self.archiveHierDepth > 1:
datePart += self.folderSeperator \
+ time.strftime("%m", mdate)
if self.archiveHierDepth > 2:
datePart += self.folderSeperator \
+ time.strftime("%d", mdate)
subFolder = archiveFolder + self.folderSeperator \
+ datePart
sfPath = os.path.join(self.folderBase,
self.folderPrefix + subFolder)
self.log(logging.INFO, "%sArchiving #%d to %s" %
(fakeMsg, i, subFolder), msg)
if not self.isTrialRun:
# Create the subfolder if needed
if not os.path.exists(sfPath):
mkMaildir(sfPath)
# Deliver
self.__mdWriter.deliver(msg, sfPath)
os.unlink(msg.fp._file.name)
self.stats[mode] += 1
else:
self.log(logging.DEBUG, "Keeping #%d (fresh)" % i, msg)
self.stats['total'] += 1
def log(self, lvl, text, msgObj):
"""Log some text with the subject of a message"""
subj = msgObj.getSubject()
if subj == None:
subj = "(no subject)"
self.logger.log(lvl, text + ": " + subj)
# Defaults
minAge = 14
mode = None
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
logging.disable(logging.INFO - 1)
logger = logging.getLogger('cleanup-maildir')
cleaner = MaildirCleaner()
# Read command-line arguments
try:
opts, args = getopt.getopt(sys.argv[1:],
"hqvnrm:t:a:kud:",
["help", "quiet", "verbose", "version", "mode=", "trash-folder=",
"age=", "keep-flagged-threads", "keep-unread-threads",
"keep-read", "folder-seperator=", "folder-prefix=",
"maildir-root=", "archive-folder=", "archive-hierarchy-depth=",
"trial-run"])
except getopt.GetoptError, (msg, opt):
logger.error("%s\n\n%s" % (msg, __doc__))
sys.exit(2)
output = None
for o, a in opts:
if o in ("-h", "--help"):
print __doc__
sys.exit()
if o in ("-q", "--quiet"):
logging.disable(logging.WARNING - 1)
if o in ("-v", "--verbose"):
logging.disable(logging.DEBUG - 1)
if o == "--version":
print __version__
sys.exit()
if o in ("-n", "--trial-run"):
cleaner.isTrialRun = True
if o in ("-m", "--mode"):
logger.warning("the --mode flag is deprecated (see --help)")
if a in ('trash', 'archive', 'delete'):
mode = a
else:
logger.error("%s is not a valid command" % a)
sys.exit(2)
if o in ("-t", "--trash-folder"):
cleaner.trashFolder = a
if o == "--archive-folder":
cleaner.archiveFolder = a
if o in ("-a", "--age"):
minAge = int(a)
if o in ("-k", "--keep-flagged-threads"):
cleaner.keepFlaggedThreads = True
if o in ("-u", "--keep-unread-threads"):
cleaner.keepUnreadThreads = True
if o in ("-r", "--keep-read"):
cleaner.keepRead = True
if o == "--folder-seperator":
cleaner.folderSeperator = a
if o == "--folder-prefix":
cleaner.folderPrefix = a
if o == "--maildir-root":
cleaner.folderBase = a
if o in ("-d", "--archive-hierarchy-depth"):
archiveHierDepth = int(a)
if archiveHierDepth < 1 or archiveHierDepth > 3:
sys.stderr.write("Error: archive hierarchy depth must be 1, " +
"2, or 3.\n")
sys.exit(2)
cleaner.archiveHierDepth = archiveHierDepth
if not cleaner.folderBase:
cleaner.folderBase = os.path.join(os.environ["HOME"], "Maildir")
if mode == None:
if len(args) < 1:
logger.error("No command specified")
sys.stderr.write(__doc__)
sys.exit(2)
mode = args.pop(0)
if not mode in ('trash', 'archive', 'delete'):
logger.error("%s is not a valid command" % mode)
sys.exit(2)
if len(args) == 0:
logger.error("No folder(s) specified")
sys.stderr.write(__doc__)
sys.exit(2)
logger.debug("Mode is " + mode)
# Clean each folder
for dir in args:
logger.debug("Cleaning up %s..." % dir)
cleaner.clean(mode, dir, minAge)
logger.info('Total messages: %5d' % cleaner.stats['total'])
logger.info('Affected messages: %5d' % cleaner.stats[mode])
logger.info('Untouched messages: %5d' %
(cleaner.stats['total'] - cleaner.stats[mode]))
@quite
Copy link

quite commented Oct 22, 2012

Line 477: remove "--"

@ehaupt
Copy link

ehaupt commented Mar 19, 2017

I find that script very useful. I've created a FreeBSD port a while ago. If you're the author could you please add it to a repository and state under which license it is published?

@vassilit
Copy link

vassilit commented Mar 4, 2018

Thank you for your script !

Are you sure we should recreate the mtime timestamp if only moving the message from one folder to another subfolder ?
132 dstName = "%d.%d_%d.%s" % (int(time.time()), os.getpid(), self.counter, socket.gethostname())

As per https://wiki2.dovecot.org/MailboxFormat/Maildir#Usage_of_timestamps and the quoted RFC, we could just os.rename() the message.

@imurray
Copy link

imurray commented Jan 4, 2019

I think the source of this script is: http://svn.houseofnate.net/unix-tools/trunk/cleanup-maildir (which now has a copyright/license on it).

@pkern
Copy link
Author

pkern commented Nov 7, 2020

And my derived version of the script mostly adds threading support by modelling in-reply-to as a graph.

@egidux77
Copy link

Hello,

any idea to go through subfolders in Maildir like .Folder1, .Folder2 ... and move them to archive folder keeping same structure Archive.2021.Folder1, Archive.2021.Folder2 and etc ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment