damontallen/Spellchecker.ipynb

## mywords.txt
damontallen
px
UF
jpg
img
iframe
youtube
google
src
edu
http
ufl
imgur
github
preload
href
wikipedia
wikimedia
autoplay
org
url
https
Github
mailto
gmail
nbviewer
IPython
www
frameborder
helpdesk
allowfullscreen
spacebar
frac
ksi
kip
wiki
png
svg
pdf
YouTube
circ
html
precast
internet
LEED
MSDS

## Spellchecker.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook Spellchecker\n",
    "\n",
    "This is not a finished product but rather a proof of concept.  This notebook allows for the rapid spelling check of completed notebook markdown cells.  It does not provide a way to edit or correct these errors (yet) but if the notebook is opened in gedit and the option to highlight misspelled words is turned on the errors can be searched for and then gedit will provide suggestions.\n",
    "\n",
    "The way it works is a saved notebook file is loaded then the json structure is parsed to look for the markdown cells.  The spell checking is done with the [enchant](https://pythonhosted.org/pyenchant/tutorial.html) python library.  One of the virtues of the enchant libraries is the availability of alternative languages.  It also allows for exceptions to be added to the loaded dictionary with the use of a simple text file.\n",
    "\n",
    "Since I use html tags and markdown links in my notebooks I use some regular expressions to remove the urls to reduce false misspelling positives.\n",
    "\n",
    "If you do not have enchant or the json library installed just install it wit pip.\n",
    "\n",
    "    pip install json\n",
    "    pip install enchant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4.0.0\n"
     ]
    }
   ],
   "source": [
    "from IPython import __version__ as IPython_version\n",
    "print(IPython_version)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import enchant\n",
    "from enchant.checker import SpellChecker\n",
    "from enchant.tokenize import EmailFilter, URLFilter\n",
    "from enchant.tokenize import get_tokenizer, HTMLChunker\n",
    "chkr = SpellChecker(\"en_US\",filters=[EmailFilter,URLFilter])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test Misspelled Words\n",
    "\n",
    "falll uup th rammp jklsd \n",
    "\n",
    "### Test HTML img tag\n",
    "\n",
    "<a href = \"http://commons.wikimedia.org/wiki/File:HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\"><img src=\"http://upload.wikimedia.org/wikipedia/commons/thumb/2/21/HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG/800px-HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\" alt = \"Construction Material image\" Title=\"Construction Material\" style=\"max-width:200px; max-height:200px; border:1px solid blue; float:left; margin-right:10px;\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Dictionary\n",
    "\n",
    "The file \"**mywords.txt**\" is a list of exceptions (or allowed words) that are added to the dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Pull in updated exception list\n",
    "lvl2 = enchant.DictWithPWL(\"en_US\",\"mywords.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create HTML and Markdown Regular Expression Searches"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Create masks to remove urls from spell checked text\n",
    "import re\n",
    "md_links = re.compile(\"]([^)]+)\") # This gets markdown link urls\n",
    "html = re.compile(\"<[^>]+>\") # This gets all html tags\n",
    "#Create masks to add back titles and alt text for html img tags\n",
    "alt = re.compile('alt\\s*=\\s*\"[^\"]+\"') # Grab the alt text\n",
    "title = re.compile('Title\\s*=\\s*\"[^\"]+\"') # Grab the Title text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enter Notebook Path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "path ='./'\n",
    "file = 'Spellchecker.ipynb'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Notebook and Check Spelling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In markdown cell -- #1 --, starting with: # Notebook Spellchecker\n",
      "\n",
      "This is not a finished product but rather a proof of co...\n",
      "\n",
      "\tERRORS:\n",
      "\tjson, urls, gedit, Spellchecker\n",
      "\n",
      "\n",
      "In markdown cell -- #2 --, starting with: ### Test Misspelled Words\n",
      "\n",
      "falll uup th rammp jklsd \n",
      "\n",
      "### Test HTML img tag\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "...\n",
      "\n",
      "\tERRORS:\n",
      "\tfalll, rammp, jklsd, uup\n",
      "\n",
      "\n",
      "In markdown cell -- #3 --, starting with: ### Load Dictionary\n",
      "\n",
      "The file \"**mywords.txt**\" is a list of exceptions (or allo...\n",
      "\n",
      "\tERRORS:\n",
      "\ttxt, mywords\n",
      "\n",
      "\n",
      "In markdown cell -- #7 --, starting with: List the removed text that contains urls....\n",
      "\n",
      "\tERRORS:\n",
      "\turls\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "with open(path+file,'r')as f:\n",
    "    txt = f.read()\n",
    "p = json.loads(txt)\n",
    "count = 0\n",
    "removed = []\n",
    "replaced = []\n",
    "for index, cell in enumerate(p['cells']):\n",
    "    if cell['cell_type'] =='markdown':\n",
    "        count += 1\n",
    "        txt2 = ''.join(cell['source'])\n",
    "        txt3 = txt2\n",
    "        for lin in list(md_links.findall(txt2))+list(html.findall(txt2)):\n",
    "            if len(lin)<1:\n",
    "                continue\n",
    "            removed.append(lin)\n",
    "            txt3 = txt3.replace(lin,'\\n')\n",
    "        for al in alt.findall(txt2):\n",
    "            A = al.split('=')[1]\n",
    "            replaced.append(A)\n",
    "            txt3+='\\n'+A\n",
    "        for ti in title.findall(txt2):\n",
    "            T = ti.split('=')[1]\n",
    "            replaced.append(T)\n",
    "            txt3+='\\n'+T\n",
    "        chkr.set_text(txt3)\n",
    "        found = False\n",
    "        words = []\n",
    "        for err in chkr:\n",
    "            word = err.word\n",
    "            if not lvl2.check(word):\n",
    "                words.append(word)\n",
    "                if not found:\n",
    "                    start_txt = txt3.strip()[:80] if len(txt3.strip())>80 else txt3.strip()\n",
    "                    print('In markdown cell -- #%d --, starting with: %s...'%(count,start_txt))\n",
    "                    found = True\n",
    "        if len(words)>0:\n",
    "            print('\\n\\tERRORS:')\n",
    "            print('\\t'+', '.join(set(words)))\n",
    "            print('\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List the removed text that contains urls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['(https://pythonhosted.org/pyenchant/tutorial.html',\n",
       " '<a href = \"http://commons.wikimedia.org/wiki/File:HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\">',\n",
       " '<img src=\"http://upload.wikimedia.org/wikipedia/commons/thumb/2/21/HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG/800px-HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\" alt = \"Construction Material image\" Title=\"Construction Material\" style=\"max-width:200px; max-height:200px; border:1px solid blue; float:left; margin-right:10px;\"/>',\n",
       " '</a>']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "removed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List the img alt text and titles that were added back."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[' \"Construction Material image\"', '\"Construction Material\"']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "replaced"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	damontallen
	px
	UF
	jpg
	img
	iframe
	youtube
	google
	src
	edu
	http
	ufl
	imgur
	github
	preload
	href
	wikipedia
	wikimedia
	autoplay
	org
	url
	https
	Github
	mailto
	gmail
	nbviewer
	IPython
	www
	frameborder
	helpdesk
	allowfullscreen
	spacebar
	frac
	ksi
	kip
	wiki
	png
	svg
	pdf
	YouTube
	circ
	html
	precast
	internet
	LEED
	MSDS
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Notebook Spellchecker\n",
	"\n",
	"This is not a finished product but rather a proof of concept. This notebook allows for the rapid spelling check of completed notebook markdown cells. It does not provide a way to edit or correct these errors (yet) but if the notebook is opened in gedit and the option to highlight misspelled words is turned on the errors can be searched for and then gedit will provide suggestions.\n",
	"\n",
	"The way it works is a saved notebook file is loaded then the json structure is parsed to look for the markdown cells. The spell checking is done with the [enchant](https://pythonhosted.org/pyenchant/tutorial.html) python library. One of the virtues of the enchant libraries is the availability of alternative languages. It also allows for exceptions to be added to the loaded dictionary with the use of a simple text file.\n",
	"\n",
	"Since I use html tags and markdown links in my notebooks I use some regular expressions to remove the urls to reduce false misspelling positives.\n",
	"\n",
	"If you do not have enchant or the json library installed just install it wit pip.\n",
	"\n",
	" pip install json\n",
	" pip install enchant"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"4.0.0\n"
	]
	}
	],
	"source": [
	"from IPython import __version__ as IPython_version\n",
	"print(IPython_version)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import json\n",
	"import enchant\n",
	"from enchant.checker import SpellChecker\n",
	"from enchant.tokenize import EmailFilter, URLFilter\n",
	"from enchant.tokenize import get_tokenizer, HTMLChunker\n",
	"chkr = SpellChecker(\"en_US\",filters=[EmailFilter,URLFilter])\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Test Misspelled Words\n",
	"\n",
	"falll uup th rammp jklsd \n",
	"\n",
	"### Test HTML img tag\n",
	"\n",
	"<a href = \"http://commons.wikimedia.org/wiki/File:HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\"><img src=\"http://upload.wikimedia.org/wikipedia/commons/thumb/2/21/HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG/800px-HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\" alt = \"Construction Material image\" Title=\"Construction Material\" style=\"max-width:200px; max-height:200px; border:1px solid blue; float:left; margin-right:10px;\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Load Dictionary\n",
	"\n",
	"The file \"mywords.txt\" is a list of exceptions (or allowed words) that are added to the dictionary."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"#Pull in updated exception list\n",
	"lvl2 = enchant.DictWithPWL(\"en_US\",\"mywords.txt\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Create HTML and Markdown Regular Expression Searches"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"#Create masks to remove urls from spell checked text\n",
	"import re\n",
	"md_links = re.compile(\"]([^)]+)\") # This gets markdown link urls\n",
	"html = re.compile(\"<[^>]+>\") # This gets all html tags\n",
	"#Create masks to add back titles and alt text for html img tags\n",
	"alt = re.compile('alt\\s=\\s\"[^\"]+\"') # Grab the alt text\n",
	"title = re.compile('Title\\s=\\s\"[^\"]+\"') # Grab the Title text"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Enter Notebook Path"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"path ='./'\n",
	"file = 'Spellchecker.ipynb'"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Load Notebook and Check Spelling"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"In markdown cell -- #1 --, starting with: # Notebook Spellchecker\n",
	"\n",
	"This is not a finished product but rather a proof of co...\n",
	"\n",
	"\tERRORS:\n",
	"\tjson, urls, gedit, Spellchecker\n",
	"\n",
	"\n",
	"In markdown cell -- #2 --, starting with: ### Test Misspelled Words\n",
	"\n",
	"falll uup th rammp jklsd \n",
	"\n",
	"### Test HTML img tag\n",
	"\n",
	"\n",
	"\n",
	"\n",
	"...\n",
	"\n",
	"\tERRORS:\n",
	"\tfalll, rammp, jklsd, uup\n",
	"\n",
	"\n",
	"In markdown cell -- #3 --, starting with: ### Load Dictionary\n",
	"\n",
	"The file \"mywords.txt\" is a list of exceptions (or allo...\n",
	"\n",
	"\tERRORS:\n",
	"\ttxt, mywords\n",
	"\n",
	"\n",
	"In markdown cell -- #7 --, starting with: List the removed text that contains urls....\n",
	"\n",
	"\tERRORS:\n",
	"\turls\n",
	"\n",
	"\n"
	]
	}
	],
	"source": [
	"with open(path+file,'r')as f:\n",
	" txt = f.read()\n",
	"p = json.loads(txt)\n",
	"count = 0\n",
	"removed = []\n",
	"replaced = []\n",
	"for index, cell in enumerate(p['cells']):\n",
	" if cell['cell_type'] =='markdown':\n",
	" count += 1\n",
	" txt2 = ''.join(cell['source'])\n",
	" txt3 = txt2\n",
	" for lin in list(md_links.findall(txt2))+list(html.findall(txt2)):\n",
	" if len(lin)<1:\n",
	" continue\n",
	" removed.append(lin)\n",
	" txt3 = txt3.replace(lin,'\\n')\n",
	" for al in alt.findall(txt2):\n",
	" A = al.split('=')[1]\n",
	" replaced.append(A)\n",
	" txt3+='\\n'+A\n",
	" for ti in title.findall(txt2):\n",
	" T = ti.split('=')[1]\n",
	" replaced.append(T)\n",
	" txt3+='\\n'+T\n",
	" chkr.set_text(txt3)\n",
	" found = False\n",
	" words = []\n",
	" for err in chkr:\n",
	" word = err.word\n",
	" if not lvl2.check(word):\n",
	" words.append(word)\n",
	" if not found:\n",
	" start_txt = txt3.strip()[:80] if len(txt3.strip())>80 else txt3.strip()\n",
	" print('In markdown cell -- #%d --, starting with: %s...'%(count,start_txt))\n",
	" found = True\n",
	" if len(words)>0:\n",
	" print('\\n\\tERRORS:')\n",
	" print('\\t'+', '.join(set(words)))\n",
	" print('\\n')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"List the removed text that contains urls."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['(https://pythonhosted.org/pyenchant/tutorial.html',\n",
	" '<a href = \"http://commons.wikimedia.org/wiki/File:HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\">',\n",
	" '<img src=\"http://upload.wikimedia.org/wikipedia/commons/thumb/2/21/HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG/800px-HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\" alt = \"Construction Material image\" Title=\"Construction Material\" style=\"max-width:200px; max-height:200px; border:1px solid blue; float:left; margin-right:10px;\"/>',\n",
	" '</a>']"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"removed"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"List the img alt text and titles that were added back."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[' \"Construction Material image\"', '\"Construction Material\"']"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"replaced"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.4.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}