Skip to content

Instantly share code, notes, and snippets.

@kadin2048
Last active April 7, 2022 11:50
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kadin2048/ffe811e56c8e8fb6ceb8bade09439341 to your computer and use it in GitHub Desktop.
Save kadin2048/ffe811e56c8e8fb6ceb8bade09439341 to your computer and use it in GitHub Desktop.
Effort to document and describe a variety of (mostly obsolete) instant messaging log file formats.

Instant Messaging Client Log Formats

An effort to document and describe a variety of (mostly obsolete) instant messaging log file formats.

by Kadin2048 kadin@sdf.org

Last updated 2021-11-03

Introduction

Instant messaging—of one sort or another—has been a part of my life almost as long as email has. But unlike Internet email, which benefits from widely-accepted standardization (RFC 822, 2822, 5322, etc.) and open-source implementations, instant messaging has generally been the domain of relatively short-lived, proprietary products, each with their own wire protocols and on-disk storage formats.

In an effort to retain as much as possible of my instant messaging history over the years, I have spent a fair bit of time trying to understand the formats used by popular products to store their message logs, and writing converters to translate them into more readable formats for archival purposes.

This document is my attempt to sum up what I have learned over the years, in the hope that it might be of some interest and utility to others.

Summary

Client File Extension File Type Self Contained?
AOL Instant Messenger for Mac none Flat text Yes
CenterIM none Flat text No
IBM Sametime Connect .html HTML Document No
Pidgin .html HTML Document Yes
Adium (Old Style) .AdiumHTMLLog HTML Fragment Yes (with filename)
Adium (ULF) .chatlog XML Document Yes
iChat .chat typedstream data Yes (?)
Messages (file-based storage) .ichat Binary PList Yes
Messages (chat.db storage) .db SQLite Database Yes

AOL Instant Messenger for Mac

At the height of its popularity in the late 90s and early 00s, AOL offered official client applications for AOL IM on several platforms, including Classic (pre-OS X) MacOS. My earliest IM logs are from this software and are still easily readable today, owing to the relatively simple plain-text format.

Although I do not have a working Classic Mac system to run the software and check, I believe that logs were saved to a folder inside Macintosh HD:System Folder:Preferences, and were grouped into subfolders by local account name.

The files are named according to the pattern [Screen Name] IM Log where [Screen Name] is the remote party's AOL IM screen name. Note that there is no file extension as part of the name, as extensions were not typically used on Classic MacOS (in favor of filesystem Creator and Type metadata, stored in the file's Resource Fork and often lost over time if the files were ever copied to a non-HFS filesystem).

File type and creator were set to TXEX/txtt on the oldest HFS backups I was able to locate and analyze.

Each logfile contains multiple conversations with the same remote screenname. The maximum size of a log file is uncertain; none of mine are larger than about 500kB. The text encoding used is also uncertain. Assuming the files have not been converted or re-saved using a editor that silently converts them, line separators are 0x0D (Carriage Return) rather than the now-common Unix-style 0x0A (Line Feed) or the Windows-standard 0x0D 0x0A (CR/LF). Many logfiles I've found contain odd high-ASCII values, but they are not clearly in any particular encoding.

Each conversation within the file begins with a date and time stamp (usually formatted either as "MM/DD/YY hh:MM AP" or "mm/DD/YY HH MM") and a Carriage Return (0x0D).

Each message within the conversation begins on a new line and starts with the AOL screen name of the party sending the message, followed by a colon (0x3A), a tab (0x09), and then the message, terminated by a 0x0D. However, messages can span multiple lines (by virtue of containing Carriage Return characters), and copying and pasting messages (something I did often, apparently) can result in what are effectively nested messages.

The following is a legal chat log containing several edge cases, shown with the following substitutions for display purposes:

  • [CR] for 0x0D (ASCII Carriage Return)
  • [SOH] for 0x01 (ASCII Start of Header, appears to be used as a placeholder character, perhaps for graphical smileys?)
  • [C2] for 0xC2 (Unknown Character; represents Unicode ¬ NOT SIGN U+00AC in MacRoman charset)
  • [A7] for 0xA7 (Unknown Character; represents Unicode ß LATIN SMALL LETTER SHARP S U+00DF in MacRoman charset)
  • [EOF] for the logical end of the file
  • Horizontal Tab (0x09) characters are left unconverted, since they are the same in UTF-8

"TheirAIMName IM Log":

1/18/2000 22 58[CR]
MyAIMName:	this is a legal message:[CR]
MyAIMName:	[SOH][CR]
MyAIMName:	and so is the following one[CR]
MyAIMName:	we can be [SOH] and [SOH] ...[CR]
TheirAIMName:	anyway i need to go[CR]
TheirAIMName:	k[CR]
MyAIMName:	back[CR]
[CR]
Auto response from TheirAIMName:[CR]
TheirAIMName:	i'm checking my negatives right now.. be right back[CR]
[CR]
MyAIMName:	k[CR]
TheirAIMName:	i'm back[CR]
[CR]
7/19/00 19 08[CR]
TheirAIMName:	hello[CR]
MyAIMName:	hi[CR]
MyAIMName:	what's goin on>[CR]
MyAIMName:	?[CR]
TheirAIMName:	ah[CR]
TheirAIMName:	not much[CR]
MyAIMName:	and what's your nick?[CR]
[CR]
TheirAIMName:	OldName82[CR]
TheirAIMName:	 Hunter [C2][A7]t@rcing.Boxers4KI 01:29:35[CR]
Bet shes gonna shella alotta dough for that operation[CR]
MyAIMName:	lol[EOF]

Note that the files do not generally end with a newline.

CenterIM

Official page: https://github.com/petrpavlu/centerim5

CenterIM is a text-mode, multi-protocol instant messaging client using libpurple on the back end. This means it supports all the same protocols that other libpurple-based clients, such as Pidgin, do. However, its logs are written in a different format than Pidgin's.

CenterIM's logs are stored inside the ~/.centerim directory as text files (seemingly ASCII, but other character sets may be supported) inside specially-named subdirectories.

A directory is created for each chat partner, prefixed with a letter indicating the protocol in use. E.g. an AOL Instant Messenger chat with a user named "joecool" would be stored in a directory named "ajoecool", with the prefix "a" indicating AOL IM.

Protocol Directory Prefix
AOL Instant Messenger (AOLIM) a
Microsoft Messenger (MSN) m
Jabber / Google Chat j

Inside each directory is a file named "history", holding the actual messages.

Each message begins with an ASCII form feed character (hex 0C) on a line by itself, followed by the string "IN" or "OUT" depending on whether the message is incoming or outgoing (from the perspective of the client program), the string "MSG" on a line by itself, then two Unix timestamps (seemingly always the same; I'm not sure why it's repeated) on lines by themselves, and then the message text on the last line.

Example (with ASCII Form Feeds represented as "[FF]"):

[FF]
IN
MSG
1196779655
1196779655
hey there! thanks for the invite, but my company holiday party is that day, so I can't make it. Next time?
[FF]
OUT
MSG
1196782478
1196782478
hey no problem, just wanted to let you know you're invited if you were in town

Note that nowhere in the actual log file are either the near- or far-side account names given. The far-side account name can be determined from the name of the enclosing directory, but the near-side name has to be supplied by the user in some other fashion.

CenterIM to XML ULF in Python

Sometime in 2007, I wrote a very basic CenterIM flat file log parser in Python, in order to try and condense all my log files into the XML-based ULF format used at the time by Adium on Mac OS X. It is described here: http://kadin.sdf-us.org/blog/technology/centerim-converter.html

CenterIM to XML ULF in Java

Several years later in 2009, I re-wrote the converter in Java, using an actual XML library to create the output document structure.

It can be found at: https://gist.github.com/kadin2048/1e6a1f1204b56d08e6612f9b33dc44f7

The following code snippet shows the actual message-parsing loop:

while ( (line = brSourceFile.readLine()) != null ) {
    if ( line.indexOf("\f") != -1 ) {
        // If we're looking at a formfeed, skip it by re-running the loop
        continue;
    }
    xmlout.writeEntity("message");
    msgs++; // Increment the message counter
    
    String direction = line; // First real line should be "IN" or "OUT"
    xmlout.writeAttribute("direction", direction);
    
    if (direction.indexOf("IN") != -1) {
    	// If it's an incoming message...
    	xmlout.writeAttribute("sender", farEndName);
    } else {
    	// If it's an outgoing message...
    	xmlout.writeAttribute("sender", sNearEndName);
    }
    
    brSourceFile.readLine(); // Then is the string "MSG", we skip it
    
    String timestamp = brSourceFile.readLine(); // Next should be timestamp
    if ( !bNoUnixDate.getValue() ) {
    	xmlout.writeAttribute("unixtime", timestamp);  // Write it unconverted
    }
    unixdate = Integer.parseInt(timestamp);
    javadate = (long) unixdate * 1000; // Java uses ms, Unix uses secs
    msgdate = new Date(javadate); // Convert the long to a date object
    xmlout.writeAttribute("time", df.format(msgdate));
    
    brSourceFile.readLine(); // Then skip the second, redundant timestamp
    String message = brSourceFile.readLine(); // Then read the message
    xmlout.writeText(message); // write the message out
    
    xmlout.endEntity(); // close the </message>
}

IBM Sametime Connect (HTML)

I used Sametime Connect, probably version 7.x or so, on Windows XP at work circa 2005-2006; this information is probably very dated and specific to versions of Sametime Connect from that era. It also does not apply to the Sametime client application embedded in Lotus Notes. Unlike some other clients, Sametime Connect does not log by default; the feature has to be explicitly enabled by the user (at least in my installation).

As of 2021, available public documentation from IBM and HCL Software (the current owner of Sametime) does not specify the storage location of log files. It is presumably somewhere in Sametime Connect's \Application Data\ directory. Within the logs directory is a personfolders.xml file, and subdirectories for each conversation partner, named according to their Sametime/Notes "contactID" (typically an email address). Inside of each of these directories is a series of subdirectories named by date as YYYYMMDD, and a file named ChatHistory.properties. These by-date subdirectories hold the conversation logs, one HTML file per day (so, one per directory). The HTML files are named using the user portion of their contactID (for "joecool@somecompany.example" this would be "joecool") followed by "log.html" (e.g. producing "joecoollog.html").

[Sametime Logs Folder]
  personfolders.xml
  [contactID]
    ChatHistory.xml
    [YYYYMMDD]
      [contactIDuser]log.html

Metadata Sidecar Files

Neither the personfolders.xml or ChatHistory.properties files are needed for reconstruction of usable chat logs, although they both contain metadata that may be of interest to the archivist.

"personfolders.xml":

<?xml version="1.0" encoding="UTF-8"?>
<folders>
  <folder communityId="Sametime_0000000003593.0000000037" displayName="Joe Cool" folderPath="joe.b.cool@us.bigco.example" id="joe.b.cool@us.bigco.example" isExternal="false"/>
  [<folder ... /> elements repeat]
</folders>

Note that the displayName attribute is not always filled-in; on many of my folders the attribute is null.

"ChatHistory.properties":

#Chat History Properties
#Mon Aug 14 13:24:35 EDT 2006
serverID=0 09118812

My version of Sametime Connect produced complete, well-formed HTML documents, one per chat partner per day. However, similar to CenterIM and some other clients, the far-side username is not explicitly provided in the log itself. Although a <meta> element in the document head does provide an attribute sametime:initiator, this does not seem to be guaranteed to be the remote-side party's username (although in many instances it is, particularly if the first message in that day's logs is from the remote side).

One way to reliably get the remote username is to consult the folder hierarchy in which the logfile is found. Typically the grandparent directory of the log file will be named using the contactID.

A parser written in Python, meant to convert Sametime HTML logs into RFC-compliant MIME multipart/mixed email messages (for archiving in an IMAP mailbox) can be found at:
https://gist.github.com/kadin2048/6d77ab7471590eedcc65

Pidgin

Official page: https://www.pidgin.im/

Pidgin, originally known as Gaim, is a multi-protocol IM client for Linux and Windows. I used it for a number of years from the late 2000s into the early 2010s for GTalk/Jabber and AOL IM. Pidgin logs are well-formed HTML and include the far-side (conversation partner's) account name inside the HTML title element.

A converter from Pidgin to Adium log format (written in Ruby) is available on Github:
https://github.com/gabebw/pidgin2adium

My own Python script, written in 2011 to convert from Pidgin to RFC-compliant .eml message files by wrapping them in the appropriate headers and transport encoding:
https://gist.github.com/kadin2048/ad0feef9f4f4230cd207907eceb17452

Adium

Official development site: https://github.com/adium/adium

Official Chat Transcript Viewer documentation: https://adium.im/help/pgs/Messaging-TranscriptViewer.html

Adium was the last desktop IM client program I used, right up until Google blocked third-party clients (a dark day). It was one of my longest-used and favorite instant messaging client programs.

Adium logs are usually stored in "~/Library/Application Support/Adium 2.0/Users/Default/Logs", although this may vary if you had multiple users configured (uncommon). Inside the Logs folder are subfolders for each local messaging account, named according to the protocol/service, a dot, and then the account username. Inside each account subfolder are further subfolders for each remote conversation partner, by username. Actual logs are stored inside, one per conversation.

Adium Folder Hierarchy:

~/Library/Application Support/Adium 2.0/Users/Default/Logs/
  <protocol>.<local_username>
    <remote_username>
      <remote_username> (<datestamp>).<extension>

Adium used two distinct log file formats over its lifespan. The first was an HTML-based format saved with the file extension ".AdiumHTMLLog", and then the later was a semi-standardized, XML-based format with the extension ".chatlog". This may have corresponded to Adium 1.x versus 2.x, but I am not sure.

Adium HTML Logs (.AdiumHTMLLog - Fragmentary Format)

Adium HTML-based logs with extension ".AdiumHTMLLog" are not well-formed HTML, but rather a series of <div> and <span> elements, with each message on one line. The file can be mated to a header and footer to form a complete HTML document for display, with the file contents inserted inside the HTML <body> element.

Example, showing a short conversation between users "someguy" (remote) and "myusername" (local):

<div class="receive"><span class="timestamp">12:41:38</span> <span class="sender">someguy: </span><pre class="message">hey</pre></div>
<div class="send"><span class="timestamp">12:45:02</span> <span class="sender">myusername: </span><pre class="message">whats up?</pre></div>
<div class="receive"><span class="timestamp">12:45:06</span> <span class="sender">someguy: </span><pre class="message">sorry</pre></div>
<div class="send"><span class="timestamp">12:46:08</span> <span class="sender">myusername: </span><pre class="message">oh, no problem</pre></div>
<div class="receive"><span class="timestamp">12:46:58</span> <span class="sender">someguy: </span><pre class="message">no biggie</pre></div>

Files do not contain any headers, but begin with the first <div>, and end with a newline.

Adium XML Logs (.chatlog - Unified Logging Format)

Adium switched to XML-based logs referred to as the "Unified Logging Format" (ULF) sometime around May 2005. Despite the name, as far as I know the format has only ever been used by Adium and never achieved widespread adoption. This is a bit unfortunate, as it's a very nice format!

The ULF logs use the extension ".chatlog" and begin with an XML declaration, specifying XML 1.0 and UTF-8. This is followed by the top-level <chat> element, which can contain one or more <message> elements, featuring a number of possible attributes, and containing the HTML-formatted message text.

Example of a ULF log containing two messages (indents and extra LFs added for clarity):

<?xml version="1.0" encoding="UTF-8" ?>
<chat xmlns="http://purl.org/net/ulf/ns/0.4-02" account="someguy" service="AIM">
  <message sender="someguy" time="2005-05-29T12:06:51-05:00">
    <div>
      <span style="background-color: #ffffff; color: #000093; font-family: Verdana; font-size: 10pt;">hey</span>
    </div>
  </message>
  <message sender="myusername" time="2005-05-29T12:07:13-05:00">
    <div>
      <span style="background-color: #acb5bf; color: #000000; font-family: Verdana; font-size: 10pt;">whats up?</span>
    </div>
  </message>
</chat>

Quite a few people have made conversion tools and scripts that handle Adium logs over the years. To get HTML, the most elegant method is with an XML stylesheet and libxslt, which is supplied with OS X. An example stylesheet can be found at:
https://matthieu.yiptong.ca/2009/12/18/converting-adium-xml-chat-logs-to-html-format/

The neat-looking "Log2Log" program, which appears to have been abandoned, also supported it:
https://log2log.sourceforge.io/

When I decided to begin migrating my old chat logs into my Gmail account for easier search and archiving, I created a Python 2 script to process each AdiumHTMLLog and ULF XML file into an RFC2822-compliant MIME multipart/mixed .eml file, with the message body containing the conversation contents as a complete HTML document (with a header and footer attached to either end where needed). It is definitely not elegant, but it worked for me. This can be found on Github:
https://gist.github.com/kadin2048/8db8767686dfe93fe045

Apple iChat

iChat was Apple's own instant messaging client and was supplied as a free application bundled with Mac OS X. It supported a number of protocols, including AOL IM and peer-to-peer messaging using the Rendezvous service discovery mechanism (particularly useful for direct file transfer over local networks).

On my machine, iChat logs seem to be stored in "~/Documents/iChats", rather than in the Library, and have the extension ".chat".

The iChat log format is binary, and as far as I know not officially documented. The Mac OS "file" program reports that one created in 2005 is of type:

NeXT/Apple typedstream data, big endian, version 4, system 1000

This StackOverflow question confirms that these files are not binary plists (as used by the ".ichat" files produced by the newer Messages application), but an older NeXT serialization format called typedstreams. A Python library exists to decode them: https://causlayer.orgs.hk/dgelessus/python-typedstream

Using "pytypedstream" (part of dgelessus' typedstream-0.0.1.dev0, see above), a ".chat" file from Apr 2005 can be decoded into:

type b'@': NSArray, 8 elements:
	NSString('AIM')
	NSString('')
	NSMutableArray, 2 elements:
		object of class InstantMessage v0, extends NSObject v0, contents:
			type b'@': object of class Presentity v0, extends Person v0, extends NSObject v0, contents:
				type b'@': NSString('AIM')
				type b'@': NSString('myusername')
			type b'@': <NSDate: 2005-04-18 00:22:24.936263+00:00>
			type b'@': object of class NSAttributedString v0, extends NSObject v0, contents:
				type b'@': NSString('hey')
				group:
					type b'i': 1
					type b'I': 3
				type b'@': NSDictionary, 3 entries:
					NSString('NSColor'): <NSColor CALIBRATED_RGBA: 0.0, 0.0, 0.0, 1.0>
					NSString('NSBackgroundColor'): <NSColor CALIBRATED_RGBA: 0.6745098233222961, 0.7098039388656616, 0.7490196228027344, 1.0>
					NSString('NSFont'): NSFont(name='Verdana', size=10.0, flags_unknown=(0x00, 0x01, 0x00, 0x00))
			type b'I': 5
		object of class InstantMessage v0, extends NSObject v0, contents:
			type b'@': None
			type b'@': <NSDate: 2005-04-18 00:30:36.754579+00:00>
			type b'@': object of class NSAttributedString v0, extends NSObject v0, contents:
				type b'@': NSString('You left the chat.')
				group:
					type b'i': 1
					type b'I': 18
				type b'@': NSDictionary, 1 entry:
					NSString('NSFont'): NSFont(name='Verdana', size=11.0, flags_unknown=(0x00, 0x01, 0x00, 0x00))
			type b'I': 1
	NSMutableArray, 1 element:
		object of class Presentity v0, extends Person v0, extends NSObject v0, contents:
			type b'@': NSString('AIM')
			type b'@': NSString('theirusername')
	NSNumber, type b'c': 0
	NSNumber, type b'i': 2
	NSString('')
	NSString('')

Logorrhea

The OS X application Logorrhea, last updated in 2006, was created to parse and view iChat logs and is able to convert them out of the binary formats: http://spiny.com/logorrhea/

The actual parsing logic, written in Objective C++ and taken from the file "Chat.mm", appears to be:

- (void) loadContents
{
	if (chatContents == nil)
	{
		NSData *chatLog = [[NSData alloc] initWithContentsOfMappedFile:myPath];
		if ([myPath hasSuffix:@".ichat"]) // check for tiger-style chat transcript
		{
			NS_DURING
				chatContents = [[NSKeyedUnarchiver unarchiveObjectWithData:chatLog] retain];
			NS_HANDLER
				NSLog(@"Caught exception from NSKeyedUnarchiver - %@", [localException reason]);
				chatContents = nil;
			NS_ENDHANDLER
			[chatLog release];
		}
		else
		{
			NS_DURING
				chatContents = [[NSUnarchiver unarchiveObjectWithData:chatLog] retain];
			NS_HANDLER
				NSLog(@"Caught exception from NSUnarchiver - %@", [localException reason]);
				chatContents = nil;
			NS_ENDHANDLER
			[chatLog release];
		}
	
		if (![chatContents isKindOfClass:[NSArray class]])
		{
			[chatContents release];
			chatContents = nil;
		}

		if (chatContents != nil)
		{
			for (unsigned int i=0; i < [chatContents count]; i++)
			{
				id obj = [chatContents objectAtIndex:i];
				if ([obj isKindOfClass:[NSArray class]])
				{
					instantMessages = [obj retain];
					break;
				}
			}
		}
	}
}

iChat to Adium ULF Converter

Despite the complexity of the iChat log format, a converter was supplied with Adium to convert iChat logs into Adium's format. (As a result, all my chat logs from the period I used iChat ended up incorporated into my Adium logs archive, as ULF XML files.)

From the file InstantMessage.m:

- (id)initWithCoder:(NSCoder *)decoder
{
	if ([decoder allowsKeyedCoding])
	{
		sender = [[decoder decodeObjectForKey:@"Sender"] retain];
		text = [[decoder decodeObjectForKey:@"MessageText"] retain];
		date = [[decoder decodeObjectForKey:@"Time"] retain];
		flags = [decoder decodeInt32ForKey:@"Flags"];
	}
	else
	{
		sender = [[decoder decodeObject] retain];
		date = [[decoder decodeObject] retain];
		text = [[decoder decodeObject] retain];
		[decoder decodeValueOfObjCType:@encode(unsigned) at:&flags];
	}

	return self;
}

Apple Messages

Storage Locations

Messages is the modern descendent of iChat, but designed primarily for Apple's proprietary iMessage service (including SMS messages sent via an iPhone) and Jabber.

The Messages application stores logged messages in two places:

  • An SQLite database named chat.db, stored in "~/Library/Messages/"
  • Individual .ichat files stored (at least with Messages v11.0, Mac OS 10.13.6) in ~/Library/Containers/com.apple.iChat/Data/Library/Messages/Archive/[YYYY-MM-DD]/ where [YYYY-MM-DD] are a series of folders named according to the date in that format.

Explanations seem to vary as to why messages end up logged in one place versus the other. It seems to be the case that the SQLite database is what's synchronized when using iCloud sync for Messages, so this is presumably the canonical message repository.

This raises the question of why, then, do the ".ichat" files inside the com.apple.ichat hierarchy exist at all? I have yet to find an adequate answer. But at least on my Mac OS 10.13.6 machine, the directories seem to contain all conversations (not just closed conversations) from mid-2018 to the present day.

Binary PList File Format

Despite the file extension of ".ichat", Messages uses a different file format for its logs than iChat does. Unlike iChat's, which are based on NeXT's legacy typedstreams, Messages' version of ".ichat" uses the Apple-created Binary Property List (PList) format.

An easy test to determine whether a file uses the old or new format is to attempt to parse it with the Apple-supplied plutil program. Parsing an old textstream log file will yield:

$ plutil -convert xml1 testaccount.chat
testaccount.chat: Property List error: Unexpected character  at line 1 / JSON error: JSON text did not start with array or object and option to allow fragments not set.

In contrast, parsing a newer PList log will result in no output, and the in-place translation of the binary PList file into an XML-based PList. (You may notice the file suddenly increase in size in the Finder, due to the XML overhead.)

An example of a short chat log, converted to XML plist format, is provided as an attachment to this document.

Files using the more-modern NSKeyedUnarchiver format should be parsable with Python, using https://pypi.org/project/NSKeyedUnArchiver/.

An option for Python 2.x is potentially the ccl-bplist project: https://github.com/cclgroupltd/ccl-bplist (last updated 2016).

SQLite Chat Database

The preferred storage location for Messages chat logs seems to be in the SQLite database stored at "~/Library/Messages/chat.db". The schema of this database has been widely studied and reported on, although some note that it does seem to change over time (generally in a backwards-compatible way).

Instructions for loading a chat.db into PANDAS with Python 3 is available at: https://towardsdatascience.com/heres-how-you-can-access-your-entire-imessage-history-on-your-mac-f8878276c6e9

For posterity:

import sqlite3
import pandas as pd
conn = sqlite3.connect('/path/to/backup/of/chat.db')
# connect to the database
cur = conn.cursor()
# get the names of the tables in the database
cur.execute(" select name from sqlite_master where type = 'table' ") for name in cur.fetchall():
    print(name)
# get the handles to apple-id mapping table
handles = pd.read_sql_query("select * from handle", conn)
# and join to the messages, on handle_id
messages.rename(columns={'ROWID' : 'message_id'}, inplace = True)
handles.rename(columns={'id' : 'phone_number', 'ROWID': 'handle_id'}, inplace = True)
merge_leve_1 = temp = pd.merge(messages[['text', 'handle_id', 'date','is_sent', 'message_id']],  handles[['handle_id', 'phone_number']], on ='handle_id', how='left')
# get the chat to message mapping
chat_message_joins = pd.read_sql_query("select * from chat_message_join", conn)
# and join back to the merge_level_1 table
df_messages = pd.merge(merge_level_1, chat_message_joins[['chat_id', 'message_id']], on = 'message_id', how='left')

Some useful notes about the date column are also provided:

In Mac OS X versions before High Sierra (which is version 10.13 and released in September 2017), the date column is an epoch type but, unlike the standard of counting the seconds from 1970–01–01, it is counting the seconds from 2001–01–01. [...] In Mac OS X High Sierra and above, it’s the same thing but the date format is now much more granular: it is in nano-second level. So now we need to divide by 1,000,000,000

A notebook implementing the code above is available at: https://github.com/yortos/imessage-analysis/blob/master/imessages-data-extract-and-prep.ipynb

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>$archiver</key>
<string>NSKeyedArchiver</string>
<key>$objects</key>
<array>
<string>$null</string>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>35</integer>
</dict>
<key>NS.objects</key>
<array>
<dict>
<key>CF$UID</key>
<integer>2</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>3</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>4</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>33</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>3</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>34</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>3</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>3</integer>
</dict>
</array>
</dict>
<string>iMessage</string>
<string></string>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>32</integer>
</dict>
<key>NS.objects</key>
<array>
<dict>
<key>CF$UID</key>
<integer>5</integer>
</dict>
</array>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>31</integer>
</dict>
<key>BaseWritingDirection</key>
<integer>-1</integer>
<key>Error</key>
<dict>
<key>CF$UID</key>
<integer>0</integer>
</dict>
<key>Flags</key>
<integer>1085445</integer>
<key>GUID</key>
<dict>
<key>CF$UID</key>
<integer>30</integer>
</dict>
<key>IsInvitation</key>
<false/>
<key>IsRead</key>
<dict>
<key>CF$UID</key>
<integer>14</integer>
</dict>
<key>MessageText</key>
<dict>
<key>CF$UID</key>
<integer>17</integer>
</dict>
<key>OriginalMessage</key>
<dict>
<key>CF$UID</key>
<integer>29</integer>
</dict>
<key>Sender</key>
<dict>
<key>CF$UID</key>
<integer>6</integer>
</dict>
<key>Subject</key>
<dict>
<key>CF$UID</key>
<integer>12</integer>
</dict>
<key>Time</key>
<dict>
<key>CF$UID</key>
<integer>15</integer>
</dict>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>11</integer>
</dict>
<key>AccountID</key>
<dict>
<key>CF$UID</key>
<integer>7</integer>
</dict>
<key>AnonymousKey</key>
<false/>
<key>ID</key>
<dict>
<key>CF$UID</key>
<integer>9</integer>
</dict>
<key>ServiceLoginID</key>
<dict>
<key>CF$UID</key>
<integer>8</integer>
</dict>
<key>ServiceName</key>
<dict>
<key>CF$UID</key>
<integer>2</integer>
</dict>
</dict>
<string>FF04AB6C-5824-44D1-88AD-13AFF196A15A</string>
<string>E:myappleid@mac.com</string>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>10</integer>
</dict>
<key>NS.string</key>
<string>e:myappleid@mac.com</string>
</dict>
<dict>
<key>$classes</key>
<array>
<string>NSMutableString</string>
<string>NSString</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSMutableString</string>
</dict>
<dict>
<key>$classes</key>
<array>
<string>Presentity</string>
<string>IMHandle</string>
<string>IMDirectlyObservableObject</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>Presentity</string>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>11</integer>
</dict>
<key>AccountID</key>
<dict>
<key>CF$UID</key>
<integer>7</integer>
</dict>
<key>AnonymousKey</key>
<false/>
<key>ID</key>
<dict>
<key>CF$UID</key>
<integer>13</integer>
</dict>
<key>ServiceLoginID</key>
<dict>
<key>CF$UID</key>
<integer>8</integer>
</dict>
<key>ServiceName</key>
<dict>
<key>CF$UID</key>
<integer>2</integer>
</dict>
</dict>
<string>+18008675309</string>
<false/>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>16</integer>
</dict>
<key>NS.time</key>
<real>547529201</real>
</dict>
<dict>
<key>$classes</key>
<array>
<string>NSDate</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSDate</string>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>28</integer>
</dict>
<key>NSAttributes</key>
<dict>
<key>CF$UID</key>
<integer>19</integer>
</dict>
<key>NSString</key>
<dict>
<key>CF$UID</key>
<integer>18</integer>
</dict>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>10</integer>
</dict>
<key>NS.string</key>
<string>this is the new file format 🤨</string>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>27</integer>
</dict>
<key>NS.keys</key>
<array>
<dict>
<key>CF$UID</key>
<integer>20</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>21</integer>
</dict>
</array>
<key>NS.objects</key>
<array>
<dict>
<key>CF$UID</key>
<integer>22</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>25</integer>
</dict>
</array>
</dict>
<string>NSFont</string>
<string>NSParagraphStyle</string>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>24</integer>
</dict>
<key>NSName</key>
<dict>
<key>CF$UID</key>
<integer>23</integer>
</dict>
<key>NSSize</key>
<real>12</real>
<key>NSfFlags</key>
<integer>272</integer>
</dict>
<string>Helvetica</string>
<dict>
<key>$classes</key>
<array>
<string>NSFont</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSFont</string>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>26</integer>
</dict>
<key>NSAlignment</key>
<integer>4</integer>
<key>NSTabStops</key>
<dict>
<key>CF$UID</key>
<integer>0</integer>
</dict>
</dict>
<dict>
<key>$classes</key>
<array>
<string>NSMutableParagraphStyle</string>
<string>NSParagraphStyle</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSMutableParagraphStyle</string>
</dict>
<dict>
<key>$classes</key>
<array>
<string>NSDictionary</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSDictionary</string>
</dict>
<dict>
<key>$classes</key>
<array>
<string>NSMutableAttributedString</string>
<string>NSAttributedString</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSMutableAttributedString</string>
</dict>
<string>this is the new file format 🤨</string>
<string>F1E42932-1289-4655-A61D-7DBD8150702A</string>
<dict>
<key>$classes</key>
<array>
<string>InstantMessage</string>
<string>IMMessage</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>InstantMessage</string>
</dict>
<dict>
<key>$classes</key>
<array>
<string>NSMutableArray</string>
<string>NSArray</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSMutableArray</string>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>32</integer>
</dict>
<key>NS.objects</key>
<array>
<dict>
<key>CF$UID</key>
<integer>12</integer>
</dict>
</array>
</dict>
<integer>2</integer>
<dict>
<key>$classes</key>
<array>
<string>NSArray</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSArray</string>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>50</integer>
</dict>
<key>NS.keys</key>
<array>
<dict>
<key>CF$UID</key>
<integer>37</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>38</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>39</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>40</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>41</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>42</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>43</integer>
</dict>
</array>
<key>NS.objects</key>
<array>
<dict>
<key>CF$UID</key>
<integer>15</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>44</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>3</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>2</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>45</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>15</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>48</integer>
</dict>
</array>
</dict>
<string>StartTime</string>
<string>LastMessageID</string>
<string>ChatRoom</string>
<string>Service</string>
<string>Participants</string>
<string>EndTime</string>
<string>PresentityIDs</string>
<integer>1</integer>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>32</integer>
</dict>
<key>NS.objects</key>
<array>
<dict>
<key>CF$UID</key>
<integer>46</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>47</integer>
</dict>
</array>
</dict>
<string>Myfirstname Mylastname</string>
<string>Theirfirstname Theirlastname</string>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>32</integer>
</dict>
<key>NS.objects</key>
<array>
<dict>
<key>CF$UID</key>
<integer>9</integer>
</dict>
<dict>
<key>CF$UID</key>
<integer>49</integer>
</dict>
</array>
</dict>
<dict>
<key>$class</key>
<dict>
<key>CF$UID</key>
<integer>10</integer>
</dict>
<key>NS.string</key>
<string>18008675309</string>
</dict>
<dict>
<key>$classes</key>
<array>
<string>NSMutableDictionary</string>
<string>NSDictionary</string>
<string>NSObject</string>
</array>
<key>$classname</key>
<string>NSMutableDictionary</string>
</dict>
</array>
<key>$top</key>
<dict>
<key>metadata</key>
<dict>
<key>CF$UID</key>
<integer>36</integer>
</dict>
<key>root</key>
<dict>
<key>CF$UID</key>
<integer>1</integer>
</dict>
</dict>
<key>$version</key>
<integer>100000</integer>
</dict>
</plist>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment