datatalking/read_apple_notes.ipynb

## read_apple_notes.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An investigation: how to read Apple Notes locally, with Python\n",
    "\n",
    "I don't know if Apple has an API for Apple Notes.  I've definitely seen people [access notes with Applescript](https://github.com/cfenollosa/NotesAppExport), but who wants to use Applescript?  I'd rather use a mainstream, full-featured programming language.  \n",
    "\n",
    "Fortunately, [Apple stores this data in an ordinary SQLLite database on the hard drive](https://apple.stackexchange.com/questions/111633/where-do-my-notes-written-in-the-notes-application-on-my-mac-get-saved).  Specifically, it's in `~/Library/Group Containers/group.com.apple.notes/NoteStore.sqlite`, though I'd recommend copying that file elsewhere before messing with it. \n",
    "\n",
    "So let's see about reading notes!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import sqlite3, gzip, unicodedata, re, sys\n",
    "# we're going to need some of these other libraries later\n",
    "\n",
    "DBNAME = \"NoteStore.sqlite\"\n",
    "conn = sqlite3.connect(DBNAME)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A sensible place to start is by investigating the database schema.  What tables do we have?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('ZICCLOUDSTATE',), ('ZICCLOUDSYNCINGOBJECT',), ('ZICLOCATION',), ('ZICNOTEDATA',), ('ZICSEARCHINDEXTRANSACTION',), ('ZICSERVERCHANGETOKEN',), ('ZNEXTID',), ('Z_PRIMARYKEY',), ('Z_METADATA',), ('Z_MODELCACHE',), ('ACHANGE',), ('ATRANSACTION',)]\n"
     ]
    }
   ],
   "source": [
    "cursor = conn.cursor()\n",
    "cursor.execute(\"SELECT name FROM sqlite_master WHERE type='table';\")\n",
    "print(cursor.fetchall())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"ZICNOTEDATA\" sounds like a good start... what columns does it have?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 'Z_PK', 'INTEGER', 0, None, 1), (1, 'Z_ENT', 'INTEGER', 0, None, 0), (2, 'Z_OPT', 'INTEGER', 0, None, 0), (3, 'ZNOTE', 'INTEGER', 0, None, 0), (4, 'ZCRYPTOINITIALIZATIONVECTOR', 'BLOB', 0, None, 0), (5, 'ZCRYPTOTAG', 'BLOB', 0, None, 0), (6, 'ZDATA', 'BLOB', 0, None, 0)]\n"
     ]
    }
   ],
   "source": [
    "cursor = conn.cursor()\n",
    "cursor.execute(\"PRAGMA table_info(ZICNOTEDATA);\")\n",
    "print(cursor.fetchall())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"ZDATA\" looks like the only plausible option here.  So let's look at the most recent couple of notes and see if that's right..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(5140, 14, 7, 7174, None, None, b'\\x1f\\x8b\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x13M\\xd0\\xc1J\\x031\\x10\\x06\\xe0MR\\xb7\\xe9\\xec\\x8aC\\x14\\x95\\x88\\x18\\xbc\\xb8\\xf6\\xaa\\x07{\\x12\\xc4\\xe2\\x13\\xf8\\x00\\xdb\\xb2\\x8b-e\\xb7\\xc4\\xa5{\\xf1\\xd4\\xe7\\x10\\xc4\\xab\\xf8\\x1a\\xbe\\x92\\x17\\x0f:\\x9b\\x0e\\xc5\\\\\\x12>\\xfe\\xfcC\\xa2#\\xb3V:\\xc2\\xc8\\xfeH3\\x9a\\x17e\\xe9\\x16\\xb5\\x87|\\xbe|v\\xbeX\\xcd\\x8a\\x16\\xaa\\xc9\\xb4\\xaeV\\x85o`V5\\xbeve\\xed\\x1d\\xc9\"o\\xdd\\xf4)_6\\x85\\x07\\xb0\\x08\\xbd\\xae\\x85z\\xc2\\x9e\\x89 \\x82\\x04l\\xd83iM\\x10@\\x19D;\\x91)\\xb6\\x14\\xe3M\\x8a\\xac\\xc7f\\xfe\\xe5v\\xb8\\xed\\x00\\xcf\\xb8-\\xe6\\xd45&A\\x80R}\\xb6+\\x14\\xc1\\x06d\\x9a\\xed\\x16\\xd5v\\xc2\\x80\\xed\\x8e\\'tw\\x81\\xed\\x9e\\xefv\\xb9\\x84\\xa7\\x8e\\x11yjj\\xf7\\x83<\\xe2\\xfb\\xa6.\\xa1\\xd8\\xae=\\x04M\\x8f\\xfe\\xa5\\xb5G\\x1f\\xb0=\\x9f\\x9f\\xc2\\t\\xa0_\\x7f\\xbc~^<\\x1c\\xbd\\xbd\\xb4\\x13\\xfc\\x1a_\\x1a\\xa5\\xbf\\x95\\x91:\\x1d\\xc6\\xfa\\xc6\\xc8c1\\x94Z\\xfc\\x01\\x8b\\x8a\\xa4;\\x87\\x01\\x00\\x00'), (5139, 14, 2, 7173, None, None, b'\\x1f\\x8b\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x13}R\\xb1n\\xd4@\\x10\\xbd\\x0b\\x17a\\xb9\\xe1\\xb8\\x02\\x90#\\xa4\\x11\\xcd\\xf9\\xd0\\xc5\\xe6BG\\x97\"\\x844P\\xa4B\\x11\\xc5\\xaewl/\\xe7\\xdd\\xb5v\\xd7gE\\xa2\\xca\\x1f\\xa4\\xa3@\\xa2\\xbe(\\xbf\\xc1w\\xc1\\xac\\t9\\n\\x84\\x1b\\xcf\\xee\\x9by3o\\xdeF\\xa3\\xd9\\xd7\\xfdh4\\x1d%\\xd7\\xfb\\xb3\\xed\\xe4\\x0cj\\xb6A8Q\\xacp \\xb5\\xf3\\xaciP\\xc0F2\\xe0\\x16{H/0@\\x87m\\xd39(\\x8dU]\\xc3>\\xa5\\xb5\\xf7\\xad{\\x93\\xe7\\x95\\xf4u\\xc7\\xb3\\xc2\\xa8\\\\\\xac\\x8eJk\\x9cG\\x91\\xd7Fa(>\\xdc\\x95\\xe6\\xbc1<W\\x8cp\\x9b\\xbf\\xfdM\\x93\\xef\\xe0\\xcc\\xf2\\xc5\"\\x03\\xf8\\x88!A\\xb0\\xcb%\\x9c\\xfd5\\x8d\\xaf\\x11>\\x9c\\xc3\\xeaU\\xb6z\\x9d\\x1d\\x81\\xeb\\xda\\xb6A\\x85\\x9a`\\xe8Z\\xc1<B\\x1ar\\xce[,\\xbcEP\\xd2\\xcb\\x8ayi4\\x18\\x8d\\x818\\x8e\\xdf\\x9b~y\\'\\xb37z\\xee\\xc1v\\x9a\\x80c- 4\\xa9P4\\xa1\\xeb\\\\Q\\xb5sRW\\xe0H\\x86\\xafCD\\xc2\\x144\\x92\\x7fn\\xb1\\n\\xd9`J`4\\x86\\x10\\xa8\\x07\\xf2cm\\xa8\\xbf\\x85\\xd6P)\\x97\\x8d\\xf4\\xc4E[A\\xf0\\x06\\x02\\xc7:\\x94H\\xbf\\x04\\xe9\\xe8\\xcc<\\xc9\\x0b\\xf4\\xb5\\xe9i\\xcdf\\x8d\\x84\\x11ca:\\x12\\x06=\\xe2\\xda\\x01\\xab\\x0c\\xf0\\xcb?[\\x08c0\\x10\\xb2,\\xd1\\x92p\\xd8\\xa0u\\x83\\xbe\\xf2~\\xb0\\xe0Y\\xd8\\xc2\\xc5p\\xfa\\x9fW\\xef\\xee\\x0c\\xda9U\\x18\\x8b\\xff\\xf4(p\\x05w\\xb2d\\x1aO\\xc2\\xb3\\xa1\\x873\\xfc\\xd3q\\xf2\\x98n\\xc6t\\xb3\\x9d$C\\x90\\xee%O\\xe2\\x88\\xc0\\x9f\\xf4=\\xa2\\xc4\\xfb\\xf8\\xc5\\xf3\\xf8 \\x9e\\xda\\xab\\x9bo\\xb7\\xf3\\xd3\\xa7\\xdf\\xbf\\xf4|\\xfa\\xe3d1{\\x10m\\'\\xb3\\xbdh\\xfc\\xf2\\xe1\\x10<\\x1b\\xff\\x02wm\\xc5T\\x9d\\x02\\x00\\x00')]\n"
     ]
    }
   ],
   "source": [
    "cursor = conn.cursor()\n",
    "cursor.execute(\"SELECT * FROM ZICNOTEDATA ORDER BY Z_PK DESC LIMIT 2;\")\n",
    "print(cursor.fetchall())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, if there's note data, it's clearly in that binary blob, as we expected.  But what is that blob?  There's [a unix system utility called file](https://linux.die.net/man/1/file) that can help us find out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cursor = conn.cursor()\n",
    "cursor.execute(\"SELECT * FROM ZICNOTEDATA ORDER BY Z_PK DESC LIMIT 1;\")\n",
    "byteblob = cursor.fetchall()[0][6]\n",
    "with open(\"randoblob\", \"wb\") as rb:\n",
    "    rb.write(byteblob)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "randoblob: gzip compressed data\r\n"
     ]
    }
   ],
   "source": [
    "!file randoblob"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's nice and easy to work with.  Let's see what the decompressed data looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "b'\\x08\\x00\\x12\\x82\\x03\\x08\\x00\\x10\\x00\\x1a\\xfb\\x02\\x129jeff lor\\najps review\\nnbconvert\\nintro for conlaw chapter\\n\\n\\x1a\\x10\\n\\x04\\x08\\x00\\x10\\x00\\x10\\x00\\x1a\\x04\\x08\\x00\\x10\\x00(\\x01\\x1a\\x10\\n\\x04\\x08\\x01\\x10\\x00\\x10\\n\\x1a\\x04\\x08\\x01\\x10\\x00(\\x02\\x1a\\x12\\n\\x04\\x08\\x01\\x10\\n\\x10\\x02\\x1a\\x04\\x08\\x01\\x10\\x08 \\x01(\\x03\\x1a\\x12\\n\\x04\\x08\\x01\\x10\\x0c\\x10\\x06\\x1a\\x04\\x08\\x01\\x10\\x00 \\x01(\\x04\\x1a\\x12\\n\\x04\\x08\\x01\\x10\\x12\\x10\\x02\\x1a\\x04\\x08\\x01\\x10\\x08 \\x01(\\x05\\x1a\\x10\\n\\x04\\x08\\x01\\x10\\x14\\x10\\x1f\\x1a\\x04\\x08\\x01\\x10\\x00(\\x06\\x1a\\x12\\n\\x04\\x08\\x01\\x104\\x10\\x0b\\x1a\\x04\\x08\\x01\\x10\\n \\x01(\\x07\\x1a\\x12\\n\\x04\\x08\\x01\\x103\\x10\\x01\\x1a\\x04\\x08\\x01\\x10\\t \\x01(\\x08\\x1a\\x12\\n\\x04\\x08\\x01\\x10?\\x10\\x03\\x1a\\x04\\x08\\x01\\x10\\x00 \\x01(\\t\\x1a\\x12\\n\\x04\\x08\\x01\\x10B\\x10\\x02\\x1a\\x04\\x08\\x01\\x10\\n \\x01(\\n\\x1a\\x12\\n\\x04\\x08\\x01\\x10D\\x10\\x01\\x1a\\x04\\x08\\x01\\x10\\x00 \\x01(\\x0b\\x1a\\x10\\n\\x04\\x08\\x01\\x10E\\x10\\x10\\x1a\\x04\\x08\\x01\\x10\\x00(\\x0c\\x1a\\x13\\n\\x04\\x08\\x01\\x10U\\x10\\xa1\\x03\\x1a\\x04\\x08\\x01\\x10\\x0b \\x01(\\r\\x1a\\x16\\n\\x08\\x08\\x00\\x10\\xff\\xff\\xff\\xff\\x0f\\x10\\x00\\x1a\\x08\\x08\\x00\\x10\\xff\\xff\\xff\\xff\\x0f\"\\x1d\\n\\x1b\\n\\x10r\\x82\\xaa\\x9c\\xac\\'G\\x17\\x9f|wb\\x10\\xc5E)\\x12\\x03\\x08\\xf6\\x03\\x12\\x02\\x08\\x0c*\\x06\\x088\\x12\\x02\\x18\\x01*\\x02\\x08\\x01'\n"
     ]
    }
   ],
   "source": [
    "decompressed = gzip.decompress(byteblob)\n",
    "print(decompressed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So I know, because I already looked at a few other notes, that this is a familiar pattern: there are about 13 or 14 nonesense bytes (which, sadly, are not identical) at the head of each note, and then several hundred nonsense bytes at the end, with the text in the middle.  Sadly, there doesn't seem to be a consistent number of nonsense bytes, especially at the end.  \n",
    "\n",
    "My first pass strategy is to chop out the unprintable bytes.  Unfortunately, we can't just use [string.printable](https://docs.python.org/3.6/library/string.html#string.printable), because it'll chop out all non-ascii characters, when what I really want is non-UTF-8 characters.  Apple Notes appears to store UTF-8 text, including things like \"smart quotes\" that I'd like to keep, and for non-English speakers there are obviously many more characters that need to be kept alive.\n",
    "\n",
    "So, as an alternate strategy, let's try to decode the string as UTF-8, ignoring anything that won't decode, and then filter out anything that isn't in a reasonable subset of the [unicode character classes](https://www.unicode.org/reports/tr44/#General_Category_Values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def clean_string_1(decompressed):\n",
    "    decoded = decompressed.decode('utf-8', 'ignore')\n",
    "    categories = set([\"L\", \"N\", \"P\", \"S\", \"Z\"])\n",
    "    translation_dictionary = {x: \" \" for x in range(sys.maxunicode) if unicodedata.category(chr(x))[0] not in categories}\n",
    "    cleaned = decoded.translate(translation_dictionary)\n",
    "    return cleaned"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           9jeff lor ajps review nbconvert intro for conlaw chapter                  (                 (                   (                   (                   (                 (        4          (        3          (        ?          (        B          (        D          (        E        (        U          (                 \"     r'G |wb E)        *  8    *   \n"
     ]
    }
   ],
   "source": [
    "print(clean_string_1(decompressed))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That already looks pretty good.  It gets rid of most of the garbage.  Let's hold onto the progress thus far in a fetch function to get as clean as possible notes and see what we can do with the rest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def fetch_notes(connection, numnotes):\n",
    "    cursor = connection.cursor()\n",
    "    cursor.execute(\"SELECT * FROM ZICNOTEDATA ORDER BY Z_PK DESC LIMIT ?;\", (numnotes,))\n",
    "    byteblobs = [x[6] for x in cursor.fetchall()]\n",
    "    decompressed = [gzip.decompress(x) for x in byteblobs]\n",
    "    cleaned = [clean_string_1(x) for x in decompressed]\n",
    "    return cleaned"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "several = fetch_notes(conn, 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['           9jeff lor ajps review nbconvert intro for conlaw chapter                  (                 (                   (                   (                   (                 (        4          (        3          (        ?          (        B          (        D          (        E        (        U          (                 \"     r\\'G |wb E)        *  8    *   ',\n",
       " '            I have Emacs installed via brew ([emacs-plus formula](https://github.com/d12frosted/homebrew-emacs-plus/blob/master/Formula/emacs-plus.rb)).  Yesterday, I installed the OS 10.13.2 supplemental update (the Spectre mitigation one).    Now, Emacs won\\'t run.  And allegedly, I\\'m missing something from libjpeg all of a sudden.    Another possibility, come to think of it, is that I somehow broke it a couple weeks ago by installing a different version of libjpeg via the [jpeg formula](https://github.com/Homebrew/homebrew-core/blob/master/Formula/jpeg.rb).                (                 (                 \"     r\\'G |wb E)        *       ',\n",
       " '           wday one logistical notes  intros go over syllabus  make sure I let them know that supplemental material may be revised                 (                 (                 (                 (        $ I      (        u        (        m          (        !          (        #        (                 \":     !  x F  ߉HSE            r\\'G |wb E)        *       *   *  ]   e  ']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "several"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, it looks like there's a pattern here at least---the nonsense characters at the end consistently (over my n of 3, but I've confirmed this with a slightly larger n that includes notes with private information) start with a bunch of spaces, open parens, bunch of spaces, open parens, butnch of spaces.  That's easy enough to take out. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def clean_string_2(partially_cleaned):\n",
    "    return re.split(r'\\s+\\(\\s+\\(\\s+', partially_cleaned)[0].strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['9jeff lor ajps review nbconvert intro for conlaw chapter', \"I have Emacs installed via brew ([emacs-plus formula](https://github.com/d12frosted/homebrew-emacs-plus/blob/master/Formula/emacs-plus.rb)).  Yesterday, I installed the OS 10.13.2 supplemental update (the Spectre mitigation one).    Now, Emacs won't run.  And allegedly, I'm missing something from libjpeg all of a sudden.    Another possibility, come to think of it, is that I somehow broke it a couple weeks ago by installing a different version of libjpeg via the [jpeg formula](https://github.com/Homebrew/homebrew-core/blob/master/Formula/jpeg.rb).\", 'wday one logistical notes  intros go over syllabus  make sure I let them know that supplemental material may be revised']\n"
     ]
    }
   ],
   "source": [
    "print([clean_string_2(x) for x in several])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And this is about as far as I can get.  I'm a little annoyed that some of these notes seem to have extra characters at the start (in notes 0 and 2 of the example, the 9 and the w, respectively, don't appear in the original notes).  This also won't be able to handle any text with that space-paren-space-paren-space pattern in there.  And heaven only knows what happens to notes with binary attachments.  \n",
    "\n",
    "I'd like to create some dummy notes to experiment with these, but it looks like the database doesn't update promptly (or doesn't get the primary keys in a sensible order) when notes are added through the UI---it might be periodically flushing from memory or from icloud or something silly like that.  But for now, at least I feel like I can moderately confidently extract the text of text-only notes with brute force."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### An investigation: how to read Apple Notes locally, with Python\n",
	"\n",
	"I don't know if Apple has an API for Apple Notes. I've definitely seen people [access notes with Applescript](https://github.com/cfenollosa/NotesAppExport), but who wants to use Applescript? I'd rather use a mainstream, full-featured programming language. \n",
	"\n",
	"Fortunately, [Apple stores this data in an ordinary SQLLite database on the hard drive](https://apple.stackexchange.com/questions/111633/where-do-my-notes-written-in-the-notes-application-on-my-mac-get-saved). Specifically, it's in `~/Library/Group Containers/group.com.apple.notes/NoteStore.sqlite`, though I'd recommend copying that file elsewhere before messing with it. \n",
	"\n",
	"So let's see about reading notes!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import sqlite3, gzip, unicodedata, re, sys\n",
	"# we're going to need some of these other libraries later\n",
	"\n",
	"DBNAME = \"NoteStore.sqlite\"\n",
	"conn = sqlite3.connect(DBNAME)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A sensible place to start is by investigating the database schema. What tables do we have?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[('ZICCLOUDSTATE',), ('ZICCLOUDSYNCINGOBJECT',), ('ZICLOCATION',), ('ZICNOTEDATA',), ('ZICSEARCHINDEXTRANSACTION',), ('ZICSERVERCHANGETOKEN',), ('ZNEXTID',), ('Z_PRIMARYKEY',), ('Z_METADATA',), ('Z_MODELCACHE',), ('ACHANGE',), ('ATRANSACTION',)]\n"
	]
	}
	],
	"source": [
	"cursor = conn.cursor()\n",
	"cursor.execute(\"SELECT name FROM sqlite_master WHERE type='table';\")\n",
	"print(cursor.fetchall())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\"ZICNOTEDATA\" sounds like a good start... what columns does it have?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[(0, 'Z_PK', 'INTEGER', 0, None, 1), (1, 'Z_ENT', 'INTEGER', 0, None, 0), (2, 'Z_OPT', 'INTEGER', 0, None, 0), (3, 'ZNOTE', 'INTEGER', 0, None, 0), (4, 'ZCRYPTOINITIALIZATIONVECTOR', 'BLOB', 0, None, 0), (5, 'ZCRYPTOTAG', 'BLOB', 0, None, 0), (6, 'ZDATA', 'BLOB', 0, None, 0)]\n"
	]
	}
	],
	"source": [
	"cursor = conn.cursor()\n",
	"cursor.execute(\"PRAGMA table_info(ZICNOTEDATA);\")\n",
	"print(cursor.fetchall())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"\"ZDATA\" looks like the only plausible option here. So let's look at the most recent couple of notes and see if that's right..."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[(5140, 14, 7, 7174, None, None, b'\\x1f\\x8b\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x13M\\xd0\\xc1J\\x031\\x10\\x06\\xe0MR\\xb7\\xe9\\xec\\x8aC\\x14\\x95\\x88\\x18\\xbc\\xb8\\xf6\\xaa\\x07{\\x12\\xc4\\xe2\\x13\\xf8\\x00\\xdb\\xb2\\x8b-e\\xb7\\xc4\\xa5{\\xf1\\xd4\\xe7\\x10\\xc4\\xab\\xf8\\x1a\\xbe\\x92\\x17\\x0f:\\x9b\\x0e\\xc5\\\\\\x12>\\xfe\\xfcC\\xa2#\\xb3V:\\xc2\\xc8\\xfeH3\\x9a\\x17e\\xe9\\x16\\xb5\\x87\|\\xbe\|v\\xbeX\\xcd\\x8a\\x16\\xaa\\xc9\\xb4\\xaeV\\x85o`V5\\xbeve\\xed\\x1d\\xc9\"o\\xdd\\xf4)_6\\x85\\x07\\xb0\\x08\\xbd\\xae\\x85z\\xc2\\x9e\\x89 \\x82\\x04l\\xd83iM\\x10@\\x19D;\\x91)\\xb6\\x14\\xe3M\\x8a\\xac\\xc7f\\xfe\\xe5v\\xb8\\xed\\x00\\xcf\\xb8-\\xe6\\xd45&A\\x80R}\\xb6+\\x14\\xc1\\x06d\\x9a\\xed\\x16\\xd5v\\xc2\\x80\\xed\\x8e\\'tw\\x81\\xed\\x9e\\xefv\\xb9\\x84\\xa7\\x8e\\x11yjj\\xf7\\x83<\\xe2\\xfb\\xa6.\\xa1\\xd8\\xae=\\x04M\\x8f\\xfe\\xa5\\xb5G\\x1f\\xb0=\\x9f\\x9f\\xc2\\t\\xa0_\\x7f\\xbc~^<\\x1c\\xbd\\xbd\\xb4\\x13\\xfc\\x1a_\\x1a\\xa5\\xbf\\x95\\x91:\\x1d\\xc6\\xfa\\xc6\\xc8c1\\x94Z\\xfc\\x01\\x8b\\x8a\\xa4;\\x87\\x01\\x00\\x00'), (5139, 14, 2, 7173, None, None, b'\\x1f\\x8b\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x13}R\\xb1n\\xd4@\\x10\\xbd\\x0b\\x17a\\xb9\\xe1\\xb8\\x02\\x90#\\xa4\\x11\\xcd\\xf9\\xd0\\xc5\\xe6BG\\x97\"\\x844P\\xa4B\\x11\\xc5\\xaewl/\\xe7\\xdd\\xb5v\\xd7gE\\xa2\\xca\\x1f\\xa4\\xa3@\\xa2\\xbe(\\xbf\\xc1w\\xc1\\xac\\t9\\n\\x84\\x1b\\xcf\\xee\\x9by3o\\xdeF\\xa3\\xd9\\xd7\\xfdh4\\x1d%\\xd7\\xfb\\xb3\\xed\\xe4\\x0cj\\xb6A8Q\\xacp \\xb5\\xf3\\xaciP\\xc0F2\\xe0\\x16{H/0@\\x87m\\xd39(\\x8dU]\\xc3>\\xa5\\xb5\\xf7\\xad{\\x93\\xe7\\x95\\xf4u\\xc7\\xb3\\xc2\\xa8\\\\\\xac\\x8eJk\\x9cG\\x91\\xd7Fa(>\\xdc\\x95\\xe6\\xbc1<W\\x8cp\\x9b\\xbf\\xfdM\\x93\\xef\\xe0\\xcc\\xf2\\xc5\"\\x03\\xf8\\x88!A\\xb0\\xcb%\\x9c\\xfd5\\x8d\\xaf\\x11>\\x9c\\xc3\\xeaU\\xb6z\\x9d\\x1d\\x81\\xeb\\xda\\xb6A\\x85\\x9a`\\xe8Z\\xc1<B\\x1ar\\xce[,\\xbcEP\\xd2\\xcb\\x8ayi4\\x18\\x8d\\x818\\x8e\\xdf\\x9b~y\\'\\xb37z\\xee\\xc1v\\x9a\\x80c- 4\\xa9P4\\xa1\\xeb\\\\Q\\xb5sRW\\xe0H\\x86\\xafCD\\xc2\\x144\\x92\\x7fn\\xb1\\n\\xd9`J`4\\x86\\x10\\xa8\\x07\\xf2cm\\xa8\\xbf\\x85\\xd6P)\\x97\\x8d\\xf4\\xc4E[A\\xf0\\x06\\x02\\xc7:\\x94H\\xbf\\x04\\xe9\\xe8\\xcc<\\xc9\\x0b\\xf4\\xb5\\xe9i\\xcdf\\x8d\\x84\\x11ca:\\x12\\x06=\\xe2\\xda\\x01\\xab\\x0c\\xf0\\xcb?[\\x08c0\\x10\\xb2,\\xd1\\x92p\\xd8\\xa0u\\x83\\xbe\\xf2~\\xb0\\xe0Y\\xd8\\xc2\\xc5p\\xfa\\x9fW\\xef\\xee\\x0c\\xda9U\\x18\\x8b\\xff\\xf4(p\\x05w\\xb2d\\x1aO\\xc2\\xb3\\xa1\\x873\\xfc\\xd3q\\xf2\\x98n\\xc6t\\xb3\\x9d$C\\x90\\xee%O\\xe2\\x88\\xc0\\x9f\\xf4=\\xa2\\xc4\\xfb\\xf8\\xc5\\xf3\\xf8 \\x9e\\xda\\xab\\x9bo\\xb7\\xf3\\xd3\\xa7\\xdf\\xbf\\xf4\|\\xfa\\xe3d1{\\x10m\\'\\xb3\\xbdh\\xfc\\xf2\\xe1\\x10<\\x1b\\xff\\x02wm\\xc5T\\x9d\\x02\\x00\\x00')]\n"
	]
	}
	],
	"source": [
	"cursor = conn.cursor()\n",
	"cursor.execute(\"SELECT * FROM ZICNOTEDATA ORDER BY Z_PK DESC LIMIT 2;\")\n",
	"print(cursor.fetchall())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Ok, if there's note data, it's clearly in that binary blob, as we expected. But what is that blob? There's [a unix system utility called file](https://linux.die.net/man/1/file) that can help us find out."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"cursor = conn.cursor()\n",
	"cursor.execute(\"SELECT * FROM ZICNOTEDATA ORDER BY Z_PK DESC LIMIT 1;\")\n",
	"byteblob = cursor.fetchall()[0][6]\n",
	"with open(\"randoblob\", \"wb\") as rb:\n",
	" rb.write(byteblob)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"randoblob: gzip compressed data\r\n"
	]
	}
	],
	"source": [
	"!file randoblob"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"That's nice and easy to work with. Let's see what the decompressed data looks like."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"b'\\x08\\x00\\x12\\x82\\x03\\x08\\x00\\x10\\x00\\x1a\\xfb\\x02\\x129jeff lor\\najps review\\nnbconvert\\nintro for conlaw chapter\\n\\n\\x1a\\x10\\n\\x04\\x08\\x00\\x10\\x00\\x10\\x00\\x1a\\x04\\x08\\x00\\x10\\x00(\\x01\\x1a\\x10\\n\\x04\\x08\\x01\\x10\\x00\\x10\\n\\x1a\\x04\\x08\\x01\\x10\\x00(\\x02\\x1a\\x12\\n\\x04\\x08\\x01\\x10\\n\\x10\\x02\\x1a\\x04\\x08\\x01\\x10\\x08 \\x01(\\x03\\x1a\\x12\\n\\x04\\x08\\x01\\x10\\x0c\\x10\\x06\\x1a\\x04\\x08\\x01\\x10\\x00 \\x01(\\x04\\x1a\\x12\\n\\x04\\x08\\x01\\x10\\x12\\x10\\x02\\x1a\\x04\\x08\\x01\\x10\\x08 \\x01(\\x05\\x1a\\x10\\n\\x04\\x08\\x01\\x10\\x14\\x10\\x1f\\x1a\\x04\\x08\\x01\\x10\\x00(\\x06\\x1a\\x12\\n\\x04\\x08\\x01\\x104\\x10\\x0b\\x1a\\x04\\x08\\x01\\x10\\n \\x01(\\x07\\x1a\\x12\\n\\x04\\x08\\x01\\x103\\x10\\x01\\x1a\\x04\\x08\\x01\\x10\\t \\x01(\\x08\\x1a\\x12\\n\\x04\\x08\\x01\\x10?\\x10\\x03\\x1a\\x04\\x08\\x01\\x10\\x00 \\x01(\\t\\x1a\\x12\\n\\x04\\x08\\x01\\x10B\\x10\\x02\\x1a\\x04\\x08\\x01\\x10\\n \\x01(\\n\\x1a\\x12\\n\\x04\\x08\\x01\\x10D\\x10\\x01\\x1a\\x04\\x08\\x01\\x10\\x00 \\x01(\\x0b\\x1a\\x10\\n\\x04\\x08\\x01\\x10E\\x10\\x10\\x1a\\x04\\x08\\x01\\x10\\x00(\\x0c\\x1a\\x13\\n\\x04\\x08\\x01\\x10U\\x10\\xa1\\x03\\x1a\\x04\\x08\\x01\\x10\\x0b \\x01(\\r\\x1a\\x16\\n\\x08\\x08\\x00\\x10\\xff\\xff\\xff\\xff\\x0f\\x10\\x00\\x1a\\x08\\x08\\x00\\x10\\xff\\xff\\xff\\xff\\x0f\"\\x1d\\n\\x1b\\n\\x10r\\x82\\xaa\\x9c\\xac\\'G\\x17\\x9f\|wb\\x10\\xc5E)\\x12\\x03\\x08\\xf6\\x03\\x12\\x02\\x08\\x0c\\x06\\x088\\x12\\x02\\x18\\x01\\x02\\x08\\x01'\n"
	]
	}
	],
	"source": [
	"decompressed = gzip.decompress(byteblob)\n",
	"print(decompressed)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So I know, because I already looked at a few other notes, that this is a familiar pattern: there are about 13 or 14 nonesense bytes (which, sadly, are not identical) at the head of each note, and then several hundred nonsense bytes at the end, with the text in the middle. Sadly, there doesn't seem to be a consistent number of nonsense bytes, especially at the end. \n",
	"\n",
	"My first pass strategy is to chop out the unprintable bytes. Unfortunately, we can't just use [string.printable](https://docs.python.org/3.6/library/string.html#string.printable), because it'll chop out all non-ascii characters, when what I really want is non-UTF-8 characters. Apple Notes appears to store UTF-8 text, including things like \"smart quotes\" that I'd like to keep, and for non-English speakers there are obviously many more characters that need to be kept alive.\n",
	"\n",
	"So, as an alternate strategy, let's try to decode the string as UTF-8, ignoring anything that won't decode, and then filter out anything that isn't in a reasonable subset of the [unicode character classes](https://www.unicode.org/reports/tr44/#General_Category_Values)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def clean_string_1(decompressed):\n",
	" decoded = decompressed.decode('utf-8', 'ignore')\n",
	" categories = set([\"L\", \"N\", \"P\", \"S\", \"Z\"])\n",
	" translation_dictionary = {x: \" \" for x in range(sys.maxunicode) if unicodedata.category(chr(x))[0] not in categories}\n",
	" cleaned = decoded.translate(translation_dictionary)\n",
	" return cleaned"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" 9jeff lor ajps review nbconvert intro for conlaw chapter ( ( ( ( ( ( 4 ( 3 ( ? ( B ( D ( E ( U ( \" r'G \|wb E) * 8 * \n"
	]
	}
	],
	"source": [
	"print(clean_string_1(decompressed))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"That already looks pretty good. It gets rid of most of the garbage. Let's hold onto the progress thus far in a fetch function to get as clean as possible notes and see what we can do with the rest."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def fetch_notes(connection, numnotes):\n",
	" cursor = connection.cursor()\n",
	" cursor.execute(\"SELECT * FROM ZICNOTEDATA ORDER BY Z_PK DESC LIMIT ?;\", (numnotes,))\n",
	" byteblobs = [x[6] for x in cursor.fetchall()]\n",
	" decompressed = [gzip.decompress(x) for x in byteblobs]\n",
	" cleaned = [clean_string_1(x) for x in decompressed]\n",
	" return cleaned"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"several = fetch_notes(conn, 3)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[' 9jeff lor ajps review nbconvert intro for conlaw chapter ( ( ( ( ( ( 4 ( 3 ( ? ( B ( D ( E ( U ( \" r\\'G \|wb E) * 8 * ',\n",
	" ' I have Emacs installed via brew ([emacs-plus formula](https://github.com/d12frosted/homebrew-emacs-plus/blob/master/Formula/emacs-plus.rb)). Yesterday, I installed the OS 10.13.2 supplemental update (the Spectre mitigation one). Now, Emacs won\\'t run. And allegedly, I\\'m missing something from libjpeg all of a sudden. Another possibility, come to think of it, is that I somehow broke it a couple weeks ago by installing a different version of libjpeg via the [jpeg formula](https://github.com/Homebrew/homebrew-core/blob/master/Formula/jpeg.rb). ( ( \" r\\'G \|wb E) * ',\n",
	" ' wday one logistical notes intros go over syllabus make sure I let them know that supplemental material may be revised ( ( ( ( $ I ( u ( m ( ! ( # ( \": ! x F ߉HSE r\\'G \|wb E) * * * ] e ']"
	]
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"several"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Ok, it looks like there's a pattern here at least---the nonsense characters at the end consistently (over my n of 3, but I've confirmed this with a slightly larger n that includes notes with private information) start with a bunch of spaces, open parens, bunch of spaces, open parens, butnch of spaces. That's easy enough to take out. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def clean_string_2(partially_cleaned):\n",
	" return re.split(r'\\s+\\(\\s+\\(\\s+', partially_cleaned)[0].strip()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['9jeff lor ajps review nbconvert intro for conlaw chapter', \"I have Emacs installed via brew ([emacs-plus formula](https://github.com/d12frosted/homebrew-emacs-plus/blob/master/Formula/emacs-plus.rb)). Yesterday, I installed the OS 10.13.2 supplemental update (the Spectre mitigation one). Now, Emacs won't run. And allegedly, I'm missing something from libjpeg all of a sudden. Another possibility, come to think of it, is that I somehow broke it a couple weeks ago by installing a different version of libjpeg via the [jpeg formula](https://github.com/Homebrew/homebrew-core/blob/master/Formula/jpeg.rb).\", 'wday one logistical notes intros go over syllabus make sure I let them know that supplemental material may be revised']\n"
	]
	}
	],
	"source": [
	"print([clean_string_2(x) for x in several])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And this is about as far as I can get. I'm a little annoyed that some of these notes seem to have extra characters at the start (in notes 0 and 2 of the example, the 9 and the w, respectively, don't appear in the original notes). This also won't be able to handle any text with that space-paren-space-paren-space pattern in there. And heaven only knows what happens to notes with binary attachments. \n",
	"\n",
	"I'd like to create some dummy notes to experiment with these, but it looks like the database doesn't update promptly (or doesn't get the primary keys in a sensible order) when notes are added through the UI---it might be periodically flushing from memory or from icloud or something silly like that. But for now, at least I feel like I can moderately confidently extract the text of text-only notes with brute force."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}