Skip to content

Instantly share code, notes, and snippets.

@tdonohue
Last active October 3, 2016 14:31
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save tdonohue/cd2a643c5fca5f140cfd to your computer and use it in GitHub Desktop.
Save tdonohue/cd2a643c5fca5f140cfd to your computer and use it in GitHub Desktop.
Migrate Mailing Lists from SourceForge to GoogleGroups

Migration of Mailing Lists from SourceForge to GoogleGroups

References

Prerequisites

  • I performed this migration from an Ubuntu 14.04 VM. So, some instructions may be Ubuntu/Debian specific.
  • Download the above mbox_send.py and ensure Python is installed

Steps

Export archives in mbox format from SourceForge

  • SourceForge provides downloadable 'mbox' exports from: https://lists.sourceforge.net/mbox/[listname]
  • Larger archives may fail to download via a browser. In that situation, wget should still work
    • e.g. wget --user=[username] --ask-password http://lists.sourceforge.net/mbox/[listname]

Export this mbox archive for each SF mailing list.

Export list subscribers from SourceForge

  • A list of all subscribers to a single SourceForge mailing list can be found at: https://sourceforge.net/p/[project]/admin/mailman/[listname]/subscribers/display

Export this subscriber list for each SF mailing list.

Setup Google Apps SMTP Relay

Because GMail's SMTP will ALWAYS change the From: field, we'll use our Google Apps SMTP Relay. If you don't have Google Apps, you could technically use any SMTP server. But, you should avoid GMail's SMTP, as using it will cause your migrated emails to all appear as if they came from one email address / user account in Google Groups.

Here's another good reference on Google Apps SMTP Relay vs GMail SMTP: https://support.google.com/a/answer/176600

  • Login as a Google Apps Admin
  • Go to "Apps" -> "Google Apps" -> "Gmail" -> "Advanced Settings"
  • Scroll down to find the "SMTP relay service" setting

In setting up my SMTP Relay, I gave it the following options:

  • Senders: "Any Addresses"
  • Authentication: "Only accept email from specific IP addresses" (Added my computer's IP)
  • Left SMTP Auth and TLS both turned off (couldn't get postfix to work right with either enabled)
  • WARNING: PLEASE BE SURE TO REMOVE THESE SETTINGS AFTER YOU HAVE FINISHED THE MIGRATION! You likely don't want anyone who is able to "spoof" your IP address to send emails via this SMTP Relay.

NOTE: Google Apps SMTP Relay has a sending limit of 10,000 messages per day per user account. So, if you have more than 10,000 messages in your old archves, you'll need to plan to send them in batchs of 10K, either on seperate days or via separate user accounts.

  • I chose to setup three "dummy" accounts just for the purpose of sending these emails, so that I was able to send up to 30K messages in a 24 hour period.

Install postfix and configure to use Google Apps SMTP Relay

Why go through all this trouble? Well, again, it's all about the From: field. Using postfix + Google Apps SMTP Relay preserves the existing From: field, while using Gmail SMTP doesn't.

(By the way, I'm using Ubuntu 14.04. Your installation of postfix may be different)

  • sudo apt-get install postfix
  • Modify the basic setup to point at Google's SMTP Relay:
    • sudo nano /etc/postfix/main.cf
    • Update/add the following:
      • myhostname = mail.[YOUR-GOOGLE-APPS-DOMAIN] (Ensures postfix "acts" like your Google Apps domain)
      • relayhost = [smtp-relay.gmail.com]:587 (Google Apps SMTP relay host/port. Yes, there really should be square brackets in this value)
      • message_size_limit = 0 (Just in case you have large messages in your archives, you don't want postfix blocking them. 0 = unlimited size)
      • header_checks = regexp:/etc/postfix/header_checks (Lets you modify/cleanup email headers, if needed, before sending to Google Groups. OPTIONAL)
  • Create a /etc/postfix/header_checks file (see header_checks setting above). Here's mine:
   # Ignore local "Received" headers
   # In my case, my local VM reports as "vagrant.dev"
   /^Received: from vagrant\.dev/ IGNORE
   
   # These next two checks work together to clean up bad dates in old emails.
   # Some older emails (circa 2004) have an invalid "Date:" format, which
   # is immediately followed by a valid "X-Original-Date:". We're removing the 
   # invalid "Date:" and replacing it with "X-Original-Date:".

   # Ignore any old Dates without a timezone on the end.
   # These are invalid and are formatted like:
   #     Weds Sep 22 10:33:05 2004
   # Correct format is:
   #     Wed, 22 Sep 2004 10:33:05 +0200
   /^Date:.*[^+-][0-9]{4}$/ IGNORE

   # Rename any "X-Original-Date:" fields to be "Date:"
   # (As these are the correctly formatted Dates)
   /^X-Original-Date: (.*)$/ REPLACE Date: $1
  • Reload Postfix
    • sudo service postfix reload

Create the new Google Group

Not much to be said here. Create it.

  • Make it private initially (as the migration of messages may take some trial & error)
  • Turn OFF emails for any initial admin accounts. Again, cause it may take some trial & error, plus you don't want to spam yourself.
  • If you are using Google Apps SMTP Relay, depending on your settings, you may need to use an account under the Google Apps Domain to actually migrate the email messages. So, give that account access to this new Google Group and turn OFF email for that account as well.

Send the 'mbox' to the Google Group

Now, let's run the mbox_send.py script, using our locally running postfix (running on localhost:25):

python mbox_send.py --to=[google-group-email] --from=[from-email] [mbox-file]

There are plenty of other options also available in the mbox_send.py script.

As an example, here's the specific command I ran to populate the archives of dspace-devel mailing list

python ./mbox_send.py --to=dspace-devel@googlegroups.com --from=[from-email] --chunk=1 --pause=1 --count=10000 dspace-devel.mbox

A few notes on that example:

  • [from-email] was a dummy account I created in my Google Apps domain to send these emails from. This account MUST be a member of the Google Group you are sending to, and have sending privileges to that account.
  • chunk=1 specifies to only send one email at a time
  • pause=1 specifies to pause for one second between chunks
  • While you don't NEED to pause between each email, I found it gives Google Groups enough time to process emails in order. If you overwhelm Google Groups, some emails may end up slightly out of order. So, feel free to tweak or remove these settings if you don't care if old emails are slighly out of order.
  • count=10000 specifies to stop after 10,000 messages. Because Google Apps SMTP Relay only allows sending 10K messages per day per user, it'll start blocking you shortly after you hit that limit. So, this setting just stops sending after 10K messages, so you can start there again the next day (or send the next 10K messages using a different user account).
    • When you are performing your FIRST migration, you may want to set this very low (e.g. count=10) just to see how things migrate into the Google Group. If something goes wrong, you can always delete the messages from the Google Group and try again.
    • KEEP IN MIND: This mbox_send.py script keeps track of the last message sent (in a *.hwm file). If you ever need to RESEND messages, you'll need to either modify the *.hwm file, or use the start=[message-number] parameter.
    • I did find that there were odd occasions where things seemed to "stall" on the Google Groups side of things. It's possible I was just impatient and needed to wait for it to continue processing. But, I did occasionally check in on the process to see if it stalled. If it did, I restarted the script at the last message that seemed to make it into the Google Group. SEE THE Stalled Email Migrations comment below for more info on my process.
    • Don't worry about sending messages multiple times. Google Groups does an excellent job of filtering out duplicates (by their Message-ID header). So, even if you send the same message 10 times, it should only appear once in GG.
  • If you are using postfix on localhost:25, you don't need to specify the smtpHost or smtpPort flags. Also, you can leave the prompted Password empty, as no password will be necessary.

Migrating Subscribers

Unfortunately, Google Groups has a hard limit of only being able to directly add 100 subscribers to a group per day. No matter how many Owners/Managers you have assigned to the Google Group, among all of them, you can only add up to 100 subscribers per day. There's seemingly no way around it.

If you have <=100 subscribers, you can simply use the "Direct Add Members" tool to add up-to-100 subscribers. You'll have to add them in batches of 10 at a time, since Google Groups only lets you add 10 at once (and prompts you with a captcha for each set)

If you have >100 subscribers, you have a few options:

  • Either add them over several days, 100 at a time. (But once you get into 1,000 subscriber territory you are talking 10+ days to add everyone)
  • OR, notify the existing list that they will need to re-subscribe to the new list.

I chose the latter option for our largest lists. I tried to make this easier on everyone by doing the following:

  • Temporarily configure the new Google Group with Join the group = "Anyone can ask"
  • A day or two in advance, notify the existing (old) list that individuals can submit a "Join Request" for the new Google Group. These "Join Requests" were held until the migration was complete.
  • Finish migrating the archives on the "switch-over" day (Make sure NOT to accept any "Join Requests" prior to the final switch over, otherwise you'll spam these users when you finish the migration of the archives.)
  • Once migration was complete, accept all "Join Requests" (adding those users immediately). Switch the setting of the group to allow "Public" to join.
@tdonohue
Copy link
Author

Stalled Email Migrations

As mentioned under the "Send the 'mbox' to the Google Group" section above, there were times where Google Groups seemed to "stall". The messages were still being sent successfully by mbox_send.py, BUT Google Groups just stopped processing them (even though it wasn't throwing an error).

For those situations, I created a slightly modified version of the mbox_send.py script which I renamed mbox_findnum.py. This mbox_findnum.py script has all the send capabilities commented out. The whole purpose of the script is to FIND the message number of a message in the archives.

Essentially, my process was to check Google Groups to see what message was appearing at the TOP (or second or third to the top, if the top message(s) had a very generic subject line). Then I'd take the subject of that message and grep for it using this mbox_findnum.py script. For example, if I thought it stalled somewhere between the 1,000th and 2,000th message, I'd use:

python mbox_findnum.py --start=1000 --count=1000 [mbox-file] | grep "[Subject-of-last-successful-message]"

The result would be a message number from the mbox. Then you can pass that message number into the mbox_send.py as the new --start parameter to restart at that message.

Here's my mbox_findnum.py modified script. It's very similar to the original script, but I commented out all the sending capabilities and removed those required, related params.

#!/usr/bin/env python

"""\
A command-line utility that can be used to determine the number of a message
based on its Subject. This is useful when something goes wrong or stalls
and you need to restart the `mbox_send.py` script at a specific point in time.
Using this script, you can determine the last successfully sent message's number
and pass it to `mbox_send.py` via the `start` parameter.
"""

# Based on the `mbox_send.py` script at
# https://github.com/wojdyr/fityk/wiki/MigrationToGoogleGroups

import sys
import os
import time
import mailbox
import email
import smtplib

from optparse import OptionParser, make_option
from getpass import getpass

# Set some defaults

defTo = []
defFrom = None
defChunkSize = 100
defChunkDelay = 1.0
defSmtpHost = 'localhost'
defSmtpPort = 25
defCount = -1
defStart = -1

# define the command line options

option_list = [
    make_option('--to', action='append', dest='toAddresses', default=defTo,
        help='The address to send the messages to. May be repeated.'),

    make_option('--from', dest='fromAddress', default=defFrom,
        help='The address to send the messages from.'),

    make_option('--chunk', type='int', dest='chunkSize', default=defChunkSize,
        help='How many messages to send in each batch before pausing, default: %d' % defChunkSize),

    make_option('--pause', type='float', dest='chunkDelay', default=defChunkDelay,
        help='How many seconds to delay between chunks. default: %f' % defChunkDelay),

    make_option('--count', type='int', dest='count', default=defCount,
        help='How many messages to send before exiting the tool, default is all messages in the mbox.'),

    make_option('--start', type='int', dest='start', default=defStart,
        help='Which message number to start with. Defaults to where the tool left off the last time, or zero.'),

    make_option('--smtpHost', dest='smtpHost', default=defSmtpHost,
        help='Hostname where SMTP server is running'),

    make_option('--smtpPort', type='int', dest='smtpPort', default=defSmtpPort,
        help='Port number to use for connecting to SMTP server'),
    ]

#---------------------------------------------------------------------------

def get_hwm(hwmfile):
    if not os.path.isfile(hwmfile):
        return -1
    with open(hwmfile, 'rt') as f:
        hwm = int(f.read().strip())
    return hwm

def set_hwm(hwmfile, count):
    with open(hwmfile, 'wt') as f:
        f.write(str(count))

def main(args):
    if sys.version_info < (2,5):
        print('Python 2.5 or better is required.')
        sys.exit(1)

    # Parse the command line args
    parser = OptionParser(
        usage='%prog [options] mbox_file(s)',
        description=__doc__,
        version='%prog 0.9.1',
        option_list=option_list)

    options, arguments = parser.parse_args(args)

    # ensure we have the required options
    #if not options.toAddresses:
    #    parser.error('At least one To address is required (use --to)')

    #if not options.fromAddress:
    #    parser.error('From address is required (use --from)')

    if not arguments:
        parser.error('At least one mbox file is required')

#    smtpPassword = getpass() # implies using TLS

    # process the mbox file(s)
    for mboxfile in arguments:
        print('Opening %s...' % mboxfile)
        mbox = mailbox.mbox(mboxfile)
        totalInMbox = len(mbox)
        print('Total messages in mbox: %d' % totalInMbox)

        #hwmfile = mboxfile + '.hwm'
        #print('Storing last message processed in %s' % hwmfile)
        #start = get_hwm(hwmfile)
        start = -1
        if options.start != -1:
            start = options.start
        start += 1
        print('Starting with message #%d' % start)

        totalSent = 0
        current = start

        # Outer loop continues until either the whole mbox or options.count
        # messages have been sent,
        while (current < totalInMbox and
            (totalSent < options.count or options.count == -1)):

            # Inner loop works one chunkSize number of messages at a time,
            # pausing and reconnecting to the SMTP server for each chunk.
            print('Connecting to SMTP(%s, %d)' % (options.smtpHost, options.smtpPort))
            #smtp = smtplib.SMTP(options.smtpHost, options.smtpPort)
            #if smtpPassword: # use TLS
            #    smtp.ehlo()
            #    smtp.starttls()
            #    smtp.ehlo()
            #    smtp.login(options.fromAddress, smtpPassword)

            chunkSent = 0
            while chunkSent < options.chunkSize:
                msg = mbox[current]
                print('Processing message %d: %s' % (current, msg['Subject']))

                try:
                    # Here is where we actually send the message
                    #smtp.sendmail(options.fromAddress, options.toAddresses, msg.as_string())

                    #set_hwm(hwmfile, current) # set new 'high water mark'
                    current += 1
                    totalSent += 1
                    chunkSent += 1
                    if (current >= totalInMbox or
                        (totalSent >= options.count and options.count != -1)):
                        break
                except smtplib.SMTPServerDisconnected as e:
                    print('Error: %s' % str(e))
                    #del smtp
                    print('Pausing for %f seconds...' % options.chunkDelay)
                    time.sleep(options.chunkDelay)
                    print('')
                    break
            else:
                #smtp.quit()
                #del smtp
                print('Pausing for %f seconds...' % options.chunkDelay)
                time.sleep(options.chunkDelay)
                print('')

    print('Goodbye')

#---------------------------------------------------------------------------

if __name__ == '__main__':
    main(sys.argv[1:])

@tdonohue
Copy link
Author

A few final notes:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment