Skip to content

Instantly share code, notes, and snippets.

@damontallen
Last active August 29, 2015 14:27
Show Gist options
  • Save damontallen/9b92d4101c10a8a37805 to your computer and use it in GitHub Desktop.
Save damontallen/9b92d4101c10a8a37805 to your computer and use it in GitHub Desktop.
This a spell checking proof of concept for IPython notebooks
damontallen
px
UF
jpg
img
iframe
youtube
google
src
edu
http
ufl
imgur
github
preload
href
wikipedia
wikimedia
autoplay
org
url
https
Github
mailto
gmail
nbviewer
IPython
www
frameborder
helpdesk
allowfullscreen
spacebar
frac
ksi
kip
wiki
png
svg
pdf
YouTube
circ
html
precast
internet
LEED
MSDS
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Notebook Spellchecker\n",
"\n",
"This is not a finished product but rather a proof of concept. This notebook allows for the rapid spelling check of completed notebook markdown cells. It does not provide a way to edit or correct these errors (yet) but if the notebook is opened in gedit and the option to highlight misspelled words is turned on the errors can be searched for and then gedit will provide suggestions.\n",
"\n",
"The way it works is a saved notebook file is loaded then the json structure is parsed to look for the markdown cells. The spell checking is done with the [enchant](https://pythonhosted.org/pyenchant/tutorial.html) python library. One of the virtues of the enchant libraries is the availability of alternative languages. It also allows for exceptions to be added to the loaded dictionary with the use of a simple text file.\n",
"\n",
"Since I use html tags and markdown links in my notebooks I use some regular expressions to remove the urls to reduce false misspelling positives.\n",
"\n",
"If you do not have enchant or the json library installed just install it wit pip.\n",
"\n",
" pip install json\n",
" pip install enchant"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4.0.0\n"
]
}
],
"source": [
"from IPython import __version__ as IPython_version\n",
"print(IPython_version)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import json\n",
"import enchant\n",
"from enchant.checker import SpellChecker\n",
"from enchant.tokenize import EmailFilter, URLFilter\n",
"from enchant.tokenize import get_tokenizer, HTMLChunker\n",
"chkr = SpellChecker(\"en_US\",filters=[EmailFilter,URLFilter])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test Misspelled Words\n",
"\n",
"falll uup th rammp jklsd \n",
"\n",
"### Test HTML img tag\n",
"\n",
"<a href = \"http://commons.wikimedia.org/wiki/File:HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\"><img src=\"http://upload.wikimedia.org/wikipedia/commons/thumb/2/21/HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG/800px-HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\" alt = \"Construction Material image\" Title=\"Construction Material\" style=\"max-width:200px; max-height:200px; border:1px solid blue; float:left; margin-right:10px;\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Dictionary\n",
"\n",
"The file \"**mywords.txt**\" is a list of exceptions (or allowed words) that are added to the dictionary."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Pull in updated exception list\n",
"lvl2 = enchant.DictWithPWL(\"en_US\",\"mywords.txt\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create HTML and Markdown Regular Expression Searches"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#Create masks to remove urls from spell checked text\n",
"import re\n",
"md_links = re.compile(\"]([^)]+)\") # This gets markdown link urls\n",
"html = re.compile(\"<[^>]+>\") # This gets all html tags\n",
"#Create masks to add back titles and alt text for html img tags\n",
"alt = re.compile('alt\\s*=\\s*\"[^\"]+\"') # Grab the alt text\n",
"title = re.compile('Title\\s*=\\s*\"[^\"]+\"') # Grab the Title text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enter Notebook Path"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"path ='./'\n",
"file = 'Spellchecker.ipynb'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load Notebook and Check Spelling"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In markdown cell -- #1 --, starting with: # Notebook Spellchecker\n",
"\n",
"This is not a finished product but rather a proof of co...\n",
"\n",
"\tERRORS:\n",
"\tjson, urls, gedit, Spellchecker\n",
"\n",
"\n",
"In markdown cell -- #2 --, starting with: ### Test Misspelled Words\n",
"\n",
"falll uup th rammp jklsd \n",
"\n",
"### Test HTML img tag\n",
"\n",
"\n",
"\n",
"\n",
"...\n",
"\n",
"\tERRORS:\n",
"\tfalll, rammp, jklsd, uup\n",
"\n",
"\n",
"In markdown cell -- #3 --, starting with: ### Load Dictionary\n",
"\n",
"The file \"**mywords.txt**\" is a list of exceptions (or allo...\n",
"\n",
"\tERRORS:\n",
"\ttxt, mywords\n",
"\n",
"\n",
"In markdown cell -- #7 --, starting with: List the removed text that contains urls....\n",
"\n",
"\tERRORS:\n",
"\turls\n",
"\n",
"\n"
]
}
],
"source": [
"with open(path+file,'r')as f:\n",
" txt = f.read()\n",
"p = json.loads(txt)\n",
"count = 0\n",
"removed = []\n",
"replaced = []\n",
"for index, cell in enumerate(p['cells']):\n",
" if cell['cell_type'] =='markdown':\n",
" count += 1\n",
" txt2 = ''.join(cell['source'])\n",
" txt3 = txt2\n",
" for lin in list(md_links.findall(txt2))+list(html.findall(txt2)):\n",
" if len(lin)<1:\n",
" continue\n",
" removed.append(lin)\n",
" txt3 = txt3.replace(lin,'\\n')\n",
" for al in alt.findall(txt2):\n",
" A = al.split('=')[1]\n",
" replaced.append(A)\n",
" txt3+='\\n'+A\n",
" for ti in title.findall(txt2):\n",
" T = ti.split('=')[1]\n",
" replaced.append(T)\n",
" txt3+='\\n'+T\n",
" chkr.set_text(txt3)\n",
" found = False\n",
" words = []\n",
" for err in chkr:\n",
" word = err.word\n",
" if not lvl2.check(word):\n",
" words.append(word)\n",
" if not found:\n",
" start_txt = txt3.strip()[:80] if len(txt3.strip())>80 else txt3.strip()\n",
" print('In markdown cell -- #%d --, starting with: %s...'%(count,start_txt))\n",
" found = True\n",
" if len(words)>0:\n",
" print('\\n\\tERRORS:')\n",
" print('\\t'+', '.join(set(words)))\n",
" print('\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"List the removed text that contains urls."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['(https://pythonhosted.org/pyenchant/tutorial.html',\n",
" '<a href = \"http://commons.wikimedia.org/wiki/File:HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\">',\n",
" '<img src=\"http://upload.wikimedia.org/wikipedia/commons/thumb/2/21/HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG/800px-HK_Central_Piers_construction_site_building_material_tiles_view_Wan_Chai.JPG\" alt = \"Construction Material image\" Title=\"Construction Material\" style=\"max-width:200px; max-height:200px; border:1px solid blue; float:left; margin-right:10px;\"/>',\n",
" '</a>']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"removed"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"List the img alt text and titles that were added back."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[' \"Construction Material image\"', '\"Construction Material\"']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"replaced"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment