nickstanisha/scrabble_riddler.ipynb

## scrabble_riddler.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scrabble Mini Riddler\n",
    "[Link to original](http://fivethirtyeight.com/features/this-challenge-will-boggle-your-mind/)\n",
    ">What is the longest word you can build in a game of Scrabble one letter at a time? That is, starting with a valid two-letter word, how long a word can you build by playing one letter at a time on either side to form a valid three-letter word, then a valid four-letter word, and so on? (For example, HE could become THE, then THEM, then THEME, then THEMES, for a six-letter result.)\n",
    "\n",
    "We're going to approach this as a graph search problem, where each of our nodes is a unique word connected via directed edges to words reachable from it (e.g. he -> the). First, we'll need a list of english words, which I've downloaded from the TWL06 dictionary [here](https://www.wordgamedictionary.com/twl06/download/twl06.txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import networkx as nx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 178691 words in our vocabulary.\n"
     ]
    }
   ],
   "source": [
    "words, starters = set(), set()\n",
    "c = 0\n",
    "with open('scrabble_word_list.txt', 'r') as infile:\n",
    "    for line in infile:\n",
    "        word = line.strip()\n",
    "        if not word.isalpha():\n",
    "            continue\n",
    "        \n",
    "        if len(word) == 2:\n",
    "            starters.add(word)\n",
    "        words.add(word)\n",
    "print(\"There are {} words in our vocabulary.\".format(len(words)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using a directed graph frees us from the burden of keeping around a list of \"visited\" nodes commonly seen in BFS and DFS implementations. This is because our word search graph is a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). This graph construction completes quickly (~900 ms), even with a vocabulary size greater than 300,000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "alphabet = 'abcdefghijklmnopqrstuvwyz'\n",
    "g = nx.DiGraph()\n",
    "\n",
    "stack = [(None, s) for s in starters]  # Parent, child pairs\n",
    "while stack:\n",
    "    parent, child = stack.pop()\n",
    "    if parent is None:\n",
    "        g.add_node(child)\n",
    "    else:\n",
    "        g.add_edge(parent, child)\n",
    "    neighbors = [(child, c + child) for c in alphabet if c + child in words]\n",
    "    neighbors += [(child, child + c) for c in alphabet if child + c in words]\n",
    "    stack.extend(neighbors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Out of 178691 words, 8331 are reachable beginning from a valid two-letter word.\n"
     ]
    }
   ],
   "source": [
    "print(\"Out of {} words, {} are reachable beginning from a valid two-letter word.\".format(len(words), len(g)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def find_path_to(word):\n",
    "    \"\"\" DFS terminating at `word`. Returns a single path to `word` \"\"\"\n",
    "    if word not in g:\n",
    "        return None\n",
    "    stack = [(s, []) for s in starters]\n",
    "    while stack:\n",
    "        cur, path = stack.pop()\n",
    "        if cur == word:\n",
    "            return path + [cur]\n",
    "        new_path = path + [cur]\n",
    "        neighbors = [(c + cur, new_path) for c in alphabet if c + cur in words]\n",
    "        neighbors += [(cur + c, new_path) for c in alphabet if cur + c in words]\n",
    "        stack.extend(neighbors)\n",
    "    return None\n",
    "\n",
    "def longest_words_from(word):\n",
    "    \"\"\" DFS starting from `word`, returns a list of the longest reachable words \"\"\"\n",
    "    if word not in g:\n",
    "        return None\n",
    "    longest_words = list()\n",
    "    stack = [word]\n",
    "    while stack:\n",
    "        cur = stack.pop()\n",
    "        neighbors = [c + cur for c in alphabet if c + cur in words]\n",
    "        neighbors += [cur + c for c in alphabet if cur + c in words]\n",
    "        stack.extend(neighbors)\n",
    "        if not neighbors:\n",
    "            if not longest_words or len(cur) > len(longest_words[0]):\n",
    "                longest_words = [cur]\n",
    "            elif len(cur) == len(longest_words[0]) and cur not in longest_words:\n",
    "                longest_words.append(cur)\n",
    "    return longest_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['scrapings']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "longest_words_from('in')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['relapsers', 'glassiest', 'sheathers', 'classists', 'scrapings', 'upraisers', 'classisms']\n"
     ]
    }
   ],
   "source": [
    "longest_words = ['a']\n",
    "for word in g:\n",
    "    if len(word) > len(longest_words[0]):\n",
    "        longest_words = [word]\n",
    "    elif len(word) == len(longest_words[0]):\n",
    "        longest_words.append(word)\n",
    "print(longest_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['la', 'lap', 'laps', 'lapse', 'elapse', 'relapse', 'relapser', 'relapsers'], ['la', 'las', 'lass', 'lassi', 'lassie', 'lassies', 'glassies', 'glassiest'], ['at', 'eat', 'eath', 'heath', 'sheath', 'sheathe', 'sheather', 'sheathers'], ['la', 'las', 'lass', 'lassi', 'lassis', 'classis', 'classist', 'classists'], ['in', 'pin', 'ping', 'aping', 'raping', 'craping', 'scraping', 'scrapings'], ['is', 'ais', 'rais', 'raise', 'raiser', 'raisers', 'praisers', 'upraisers'], ['la', 'las', 'lass', 'lassi', 'lassis', 'classis', 'classism', 'classisms']]\n"
     ]
    }
   ],
   "source": [
    "print([find_path_to(w) for w in longest_words])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "find_path_to('pastorales')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#nx.write_graphml(g, 'word_search.graphml')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "adj_list = dict()\n",
    "for node in g:\n",
    "    adj_list[node] = sorted(g.neighbors(node))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import json\n",
    "with open('word_graph.json', 'w') as outfile:\n",
    "    json.dump(adj_list, outfile)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Scrabble Mini Riddler\n",
	"[Link to original](http://fivethirtyeight.com/features/this-challenge-will-boggle-your-mind/)\n",
	">What is the longest word you can build in a game of Scrabble one letter at a time? That is, starting with a valid two-letter word, how long a word can you build by playing one letter at a time on either side to form a valid three-letter word, then a valid four-letter word, and so on? (For example, HE could become THE, then THEM, then THEME, then THEMES, for a six-letter result.)\n",
	"\n",
	"We're going to approach this as a graph search problem, where each of our nodes is a unique word connected via directed edges to words reachable from it (e.g. he -> the). First, we'll need a list of english words, which I've downloaded from the TWL06 dictionary [here](https://www.wordgamedictionary.com/twl06/download/twl06.txt)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import networkx as nx"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"There are 178691 words in our vocabulary.\n"
	]
	}
	],
	"source": [
	"words, starters = set(), set()\n",
	"c = 0\n",
	"with open('scrabble_word_list.txt', 'r') as infile:\n",
	" for line in infile:\n",
	" word = line.strip()\n",
	" if not word.isalpha():\n",
	" continue\n",
	" \n",
	" if len(word) == 2:\n",
	" starters.add(word)\n",
	" words.add(word)\n",
	"print(\"There are {} words in our vocabulary.\".format(len(words)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using a directed graph frees us from the burden of keeping around a list of \"visited\" nodes commonly seen in BFS and DFS implementations. This is because our word search graph is a [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). This graph construction completes quickly (~900 ms), even with a vocabulary size greater than 300,000."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"alphabet = 'abcdefghijklmnopqrstuvwyz'\n",
	"g = nx.DiGraph()\n",
	"\n",
	"stack = [(None, s) for s in starters] # Parent, child pairs\n",
	"while stack:\n",
	" parent, child = stack.pop()\n",
	" if parent is None:\n",
	" g.add_node(child)\n",
	" else:\n",
	" g.add_edge(parent, child)\n",
	" neighbors = [(child, c + child) for c in alphabet if c + child in words]\n",
	" neighbors += [(child, child + c) for c in alphabet if child + c in words]\n",
	" stack.extend(neighbors)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Out of 178691 words, 8331 are reachable beginning from a valid two-letter word.\n"
	]
	}
	],
	"source": [
	"print(\"Out of {} words, {} are reachable beginning from a valid two-letter word.\".format(len(words), len(g)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def find_path_to(word):\n",
	" \"\"\" DFS terminating at `word`. Returns a single path to `word` \"\"\"\n",
	" if word not in g:\n",
	" return None\n",
	" stack = [(s, []) for s in starters]\n",
	" while stack:\n",
	" cur, path = stack.pop()\n",
	" if cur == word:\n",
	" return path + [cur]\n",
	" new_path = path + [cur]\n",
	" neighbors = [(c + cur, new_path) for c in alphabet if c + cur in words]\n",
	" neighbors += [(cur + c, new_path) for c in alphabet if cur + c in words]\n",
	" stack.extend(neighbors)\n",
	" return None\n",
	"\n",
	"def longest_words_from(word):\n",
	" \"\"\" DFS starting from `word`, returns a list of the longest reachable words \"\"\"\n",
	" if word not in g:\n",
	" return None\n",
	" longest_words = list()\n",
	" stack = [word]\n",
	" while stack:\n",
	" cur = stack.pop()\n",
	" neighbors = [c + cur for c in alphabet if c + cur in words]\n",
	" neighbors += [cur + c for c in alphabet if cur + c in words]\n",
	" stack.extend(neighbors)\n",
	" if not neighbors:\n",
	" if not longest_words or len(cur) > len(longest_words[0]):\n",
	" longest_words = [cur]\n",
	" elif len(cur) == len(longest_words[0]) and cur not in longest_words:\n",
	" longest_words.append(cur)\n",
	" return longest_words"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['scrapings']"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"longest_words_from('in')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['relapsers', 'glassiest', 'sheathers', 'classists', 'scrapings', 'upraisers', 'classisms']\n"
	]
	}
	],
	"source": [
	"longest_words = ['a']\n",
	"for word in g:\n",
	" if len(word) > len(longest_words[0]):\n",
	" longest_words = [word]\n",
	" elif len(word) == len(longest_words[0]):\n",
	" longest_words.append(word)\n",
	"print(longest_words)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"[['la', 'lap', 'laps', 'lapse', 'elapse', 'relapse', 'relapser', 'relapsers'], ['la', 'las', 'lass', 'lassi', 'lassie', 'lassies', 'glassies', 'glassiest'], ['at', 'eat', 'eath', 'heath', 'sheath', 'sheathe', 'sheather', 'sheathers'], ['la', 'las', 'lass', 'lassi', 'lassis', 'classis', 'classist', 'classists'], ['in', 'pin', 'ping', 'aping', 'raping', 'craping', 'scraping', 'scrapings'], ['is', 'ais', 'rais', 'raise', 'raiser', 'raisers', 'praisers', 'upraisers'], ['la', 'las', 'lass', 'lassi', 'lassis', 'classis', 'classism', 'classisms']]\n"
	]
	}
	],
	"source": [
	"print([find_path_to(w) for w in longest_words])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"find_path_to('pastorales')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"#nx.write_graphml(g, 'word_search.graphml')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"adj_list = dict()\n",
	"for node in g:\n",
	" adj_list[node] = sorted(g.neighbors(node))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import json\n",
	"with open('word_graph.json', 'w') as outfile:\n",
	" json.dump(adj_list, outfile)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}