christopherphan/best_wordle_words_remaining.ipynb

## best_wordle_words_remaining.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "40835e56-69ef-4cad-ad86-a0073d969ec1",
   "metadata": {},
   "source": [
    "# Best starting Wordle word: remaining possible words approach\n",
    "\n",
    "Christopher Phan <chrisphan.com>\n",
    "\n",
    "Last revised 2022-03-05\n",
    "\n",
    "\n",
    "[Wordle](https://www.nytimes.com/games/wordle/index.html) is a word puzzle game and massive social media sensation, now owned by the *New York Times*. Exactly one puzzle is given every day, revealed at midnight local time. The goal is to determine the solution word in six or fewer guesses. The solution is always a five-letter word. After each guess, the game indicates which letters in the guessed word (1) are not in the solution word, (2) are in the solution word, but in the wrong positions, or (3) are in the solution word in the same position.\n",
    "\n",
    "For example, if your guess was \"PIZZA\" and the solution word was \"FAZED\", the game would indicate with the following output:\n",
    "\n",
    "![](pizza.png)\n",
    "\n",
    "This indicates that the letters P, I, and the second Z are not in the solution (there's only one Z), the first Z is in the correct location, and the A is in the solution but not in that position.\n",
    "\n",
    "Note that the first guess is made without any information. My goal is to *find the best starting word*, by looking at how many words are ruled out, on average.\n",
    "\n",
    "## Imported modules\n",
    "\n",
    "In addition to items in the standard library, I will be importing `pandas` for data analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c1ecfadc-5daa-4a39-b6de-f46c1a393c65",
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "from dataclasses import dataclass\n",
    "from collections.abc import Sequence\n",
    "from typing import Final, Optional\n",
    "from math import fsum\n",
    "from random import sample\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import scipy.stats as stats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "870d9097-3153-4470-a706-6b4eb1370369",
   "metadata": {},
   "source": [
    "## The word list\n",
    "\n",
    "In a previous version of this analysis, I used [Spell Checker Oriented Word Lists](http://wordlist.aspell.net) to make a list of 5-letter English words. \n",
    "\n",
    "Instead, this time, I am going to use the game's own list of acceptable words (extracted from the source code), found by combining these\n",
    "lists. **WARNING:** Looking at these will reveal answers. I recommend covering the screen with your hand to avoid spoiling the puzzle! \n",
    "* https://gist.github.com/cfreshman/40608e78e83eb4e1d60b285eb7e9732f (Acceptable words that aren't solutions.)\n",
    "* https://gist.github.com/cfreshman/a7b776506c73284511034e63af1017ee (Solutions in alphabetical order.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a8dae571-fcc5-4a03-aa65-a88f6ec71ff4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "12947"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open(\"nyt_word_list.txt\", \"rt\") as infile:\n",
    "    word_list: list[str] = [k.upper() for k in infile.read().split(\"\\n\")]\n",
    "len(word_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a6dbc9a-5241-4ba9-bcac-7bc4e7751d6c",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Letter data\n",
    "\n",
    "Using a `Counter`, we produce a `pandas.DataFrame` with the following information about each word:\n",
    "\n",
    "1. How many of each letter is present\n",
    "2. Which letter is in each position"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "92452d04-4ad6-4e28-93d9-ede2998a722d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "      <th>E</th>\n",
       "      <th>F</th>\n",
       "      <th>G</th>\n",
       "      <th>H</th>\n",
       "      <th>I</th>\n",
       "      <th>J</th>\n",
       "      <th>...</th>\n",
       "      <th>V</th>\n",
       "      <th>W</th>\n",
       "      <th>X</th>\n",
       "      <th>Y</th>\n",
       "      <th>Z</th>\n",
       "      <th>pos0</th>\n",
       "      <th>pos1</th>\n",
       "      <th>pos2</th>\n",
       "      <th>pos3</th>\n",
       "      <th>pos4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>AAHED</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>A</td>\n",
       "      <td>H</td>\n",
       "      <td>E</td>\n",
       "      <td>D</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AALII</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>A</td>\n",
       "      <td>L</td>\n",
       "      <td>I</td>\n",
       "      <td>I</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AARGH</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>A</td>\n",
       "      <td>R</td>\n",
       "      <td>G</td>\n",
       "      <td>H</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>AARTI</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>A</td>\n",
       "      <td>R</td>\n",
       "      <td>T</td>\n",
       "      <td>I</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ABACA</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>A</td>\n",
       "      <td>B</td>\n",
       "      <td>A</td>\n",
       "      <td>C</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 31 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       A  B  C  D  E  F  G  H  I  J  ...  V  W  X  Y  Z  pos0  pos1  pos2  \\\n",
       "AAHED  2  0  0  1  1  0  0  1  0  0  ...  0  0  0  0  0     A     A     H   \n",
       "AALII  2  0  0  0  0  0  0  0  2  0  ...  0  0  0  0  0     A     A     L   \n",
       "AARGH  2  0  0  0  0  0  1  1  0  0  ...  0  0  0  0  0     A     A     R   \n",
       "AARTI  2  0  0  0  0  0  0  0  1  0  ...  0  0  0  0  0     A     A     R   \n",
       "ABACA  3  1  1  0  0  0  0  0  0  0  ...  0  0  0  0  0     A     B     A   \n",
       "\n",
       "       pos3  pos4  \n",
       "AAHED     E     D  \n",
       "AALII     I     I  \n",
       "AARGH     G     H  \n",
       "AARTI     T     I  \n",
       "ABACA     C     A  \n",
       "\n",
       "[5 rows x 31 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "LETTERS: Final[list[str]] = [chr(65 + k) for k in range(26)]\n",
    "\n",
    "def letter_counter(w: str) -> tuple[int, ...]:\n",
    "    c = Counter(w)\n",
    "    return tuple([c[u] for u in LETTERS])\n",
    "\n",
    "\n",
    "letter_data = pd.DataFrame.from_dict(\n",
    "    {word: letter_counter(word) for word in word_list}, orient=\"index\", columns=LETTERS\n",
    ")\n",
    "\n",
    "for pos in range(5):\n",
    "    letter_data[f\"pos{pos}\"] = letter_data.index.map(lambda x: x[pos])\n",
    "\n",
    "letter_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19e1b8aa-8524-4ca2-af64-4f282bffff88",
   "metadata": {},
   "source": [
    "## A filtering function\n",
    "\n",
    "Given a `solution` word and a `guess` word, we want to figure out which words are still possible. The game returns the following information:\n",
    "\n",
    "(1) It shows all the letters in the `guess` that are correct (same letter in the same position) in green. \n",
    "(2) If a letter in the `guess` is also in the `solution` but in a different position, it will be shown in yellow. However, the number of times the same letter in the `guess` is shaded green or yellow will not exceed the number of times that letter appears in the `solution`. For example, in the PIZZA example above, the second Z is shaded gray and not yellow because there is only one Z in the `solution`.\n",
    "(3) The remaining letters are shaded gray.\n",
    "\n",
    "We create a function to create a query that can be run on a subset of `letter_data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "83e29070-e218-4007-b3c6-07f4dd4b9899",
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_query(solution: str, guess: str) -> str:\n",
    "    query_items = []\n",
    "    counted_letters = []\n",
    "    for idx, item in enumerate(zip(solution.upper(), guess.upper())):\n",
    "        if item[0] == item[1]:\n",
    "            query_items.append(f\"pos{idx} == {item[1]!r}\")\n",
    "        else:\n",
    "            query_items.append(f\"pos{idx} != {item[1]!r}\")\n",
    "        if item[1] not in counted_letters:\n",
    "            counted_letters.append(item[1])\n",
    "            if (count := guess.upper().count(item[1])) > (num := solution.upper().count(item[1])):\n",
    "                query_items.append(f\"{item[1]} <= {num}\")\n",
    "            elif count <= num:\n",
    "                query_items.append(f\"{item[1]} >= {count}\")\n",
    "    return \" and \".join(query_items)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65de6b71-519e-4cc2-acb5-d772d458a7d7",
   "metadata": {},
   "source": [
    "Here's an example of the information we get from guessing PIZZA when the solution is FAZED:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3cba8e15-ae6e-4b05-8734-ffd01c0ab4ed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"pos0 != 'P' and P <= 0 and pos1 != 'I' and I <= 0 and pos2 == 'Z' and Z <= 1 and pos3 != 'Z' and pos4 != 'A' and A >= 1\""
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "make_query(\"fazed\", \"pizza\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a86bf311-1729-4028-9a7f-509fc1de188d",
   "metadata": {},
   "source": [
    "We'll make a function to produce a list of the remaining possible words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c762bc3f-7875-4d22-ad88-ea56321c006e",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['ADZED', 'ADZES', 'BAZAR', 'BAZOO', 'DAZED', 'DAZER', 'DAZES', 'FAZED',\n",
       "       'FAZES', 'GAZAL', 'GAZAR', 'GAZED', 'GAZER', 'GAZES', 'GAZON', 'GAZOO',\n",
       "       'HAZAN', 'HAZED', 'HAZEL', 'HAZER', 'HAZES', 'KAZOO', 'LAZAR', 'LAZED',\n",
       "       'LAZES', 'LAZOS', 'MAZED', 'MAZER', 'MAZES', 'MAZEY', 'MAZUT', 'MUZAK',\n",
       "       'NAZES', 'RAZED', 'RAZEE', 'RAZER', 'RAZES', 'RAZOO', 'RAZOR', 'SAZES',\n",
       "       'WAZOO'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def remaining_possible_words(solution: str, guess: str, test_list: Optional[Sequence[str]] = None) -> list[str]:\n",
    "    if test_list is None:\n",
    "        test_data = letter_data\n",
    "    else:\n",
    "        test_data = letter_data[letter_data.index.isin([x.upper() for x in test_list])]\n",
    "    return test_data.query(make_query(solution, guess)).index\n",
    "\n",
    "remaining_possible_words(\"fazed\", \"pizza\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb20939f-5134-43c3-9314-4cab992adf10",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Finding the best initial words\n",
    "\n",
    "One naïve way to proceed is to compute `remaining_possible_words` for each `solution, guess` pair and then find the mean for each given `guess`. However, there are over 12,000 words in the word list, so testing every `solution, guess` pair would mean running `remaining_possible_words` more than 16 million times. So, instead, we will only test against a *random sample* of solutions and a random sample of words to rule out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d4c9da3b-fbb9-4f66-bc8c-173171963305",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "solution_sample: list[str] = sample(word_list, 200)\n",
    "test_sample: list[str] = sample(word_list, 200)\n",
    "\n",
    "def score_word(guess, solution_list: Optional[Sequence[str]]=None, test_list: Optional[Sequence[str]]=None) -> float:\n",
    "    if solution_list is None:\n",
    "        solution_list = word_list\n",
    "    solutions = np.array([len(remaining_possible_words(solution, guess, test_list)) for solution in solution_list])\n",
    "    return np.mean(solutions)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c7f5fb7-d318-484b-918d-f69252a1e78b",
   "metadata": {},
   "source": [
    "For example, let's score PIZZA and AROSE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bcece640-ef45-4d4d-8073-ed8b7c18a78c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "40.54\n",
      "6.53\n"
     ]
    }
   ],
   "source": [
    "for k in [\"PIZZA\", \"AROSE\"]:\n",
    "    print(score_word(k, solution_sample, test_sample))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3268863f-3074-4c16-823a-ecc16e3c7369",
   "metadata": {},
   "source": [
    "That means AROSE is a better initial guess than PIZZA (remember, lower is better).\n",
    "\n",
    "## Finale\n",
    "\n",
    "Now, we are ready to score all of the words and show the best-scoring words. (This cell took my late-2020 Mac M1 Mini over an hour to evaluate.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "7adf9110-2a22-42eb-a060-4b593a682107",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LARES    4.590\n",
       "LANES    4.665\n",
       "RALES    4.700\n",
       "ALOES    4.875\n",
       "TALES    4.925\n",
       "SALET    5.005\n",
       "ARLES    5.005\n",
       "LORES    5.025\n",
       "SALUE    5.025\n",
       "LAERS    5.045\n",
       "LEANS    5.085\n",
       "AEROS    5.115\n",
       "AEONS    5.150\n",
       "LEARS    5.170\n",
       "SOARE    5.185\n",
       "ROLES    5.195\n",
       "NARES    5.220\n",
       "EARLS    5.280\n",
       "TARES    5.280\n",
       "SOLER    5.325\n",
       "dtype: float64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_scores = pd.Series(\n",
    "    {word: score_word(word, solution_sample, test_sample) for word in word_list})\n",
    "\n",
    "word_scores.sort_values().head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8dc94069-07ca-40fb-a881-fe471f9ef6b8",
   "metadata": {},
   "source": [
    "Note that most of these words end in S. That makes sense, since many words (plural nouns and third-person present-tense verbs) end with S. Guessing with a word ending in S would immediately either rule these words out or confirm that the solution ends with S.\n",
    "\n",
    "So next time you play Wordle, perhaps start with LARES, LANES, RALES, ALOES, or another word from this list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc98e572-55f7-4a61-935c-5190368c364f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "data1",
   "language": "python",
   "name": "data1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

## pizza.png

      
    Raw
  

              pizza.png
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "40835e56-69ef-4cad-ad86-a0073d969ec1",
	"metadata": {},
	"source": [
	"# Best starting Wordle word: remaining possible words approach\n",
	"\n",
	"Christopher Phan <chrisphan.com>\n",
	"\n",
	"Last revised 2022-03-05\n",
	"\n",
	"\n",
	"[Wordle](https://www.nytimes.com/games/wordle/index.html) is a word puzzle game and massive social media sensation, now owned by the New York Times. Exactly one puzzle is given every day, revealed at midnight local time. The goal is to determine the solution word in six or fewer guesses. The solution is always a five-letter word. After each guess, the game indicates which letters in the guessed word (1) are not in the solution word, (2) are in the solution word, but in the wrong positions, or (3) are in the solution word in the same position.\n",
	"\n",
	"For example, if your guess was \"PIZZA\" and the solution word was \"FAZED\", the game would indicate with the following output:\n",
	"\n",
	"![](pizza.png)\n",
	"\n",
	"This indicates that the letters P, I, and the second Z are not in the solution (there's only one Z), the first Z is in the correct location, and the A is in the solution but not in that position.\n",
	"\n",
	"Note that the first guess is made without any information. My goal is to find the best starting word, by looking at how many words are ruled out, on average.\n",
	"\n",
	"## Imported modules\n",
	"\n",
	"In addition to items in the standard library, I will be importing `pandas` for data analysis."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "c1ecfadc-5daa-4a39-b6de-f46c1a393c65",
	"metadata": {},
	"outputs": [],
	"source": [
	"from collections import Counter\n",
	"from dataclasses import dataclass\n",
	"from collections.abc import Sequence\n",
	"from typing import Final, Optional\n",
	"from math import fsum\n",
	"from random import sample\n",
	"\n",
	"import pandas as pd\n",
	"import numpy as np\n",
	"import scipy.stats as stats"
	]
	},
	{
	"cell_type": "markdown",
	"id": "870d9097-3153-4470-a706-6b4eb1370369",
	"metadata": {},
	"source": [
	"## The word list\n",
	"\n",
	"In a previous version of this analysis, I used [Spell Checker Oriented Word Lists](http://wordlist.aspell.net) to make a list of 5-letter English words. \n",
	"\n",
	"Instead, this time, I am going to use the game's own list of acceptable words (extracted from the source code), found by combining these\n",
	"lists. WARNING: Looking at these will reveal answers. I recommend covering the screen with your hand to avoid spoiling the puzzle! \n",
	"* https://gist.github.com/cfreshman/40608e78e83eb4e1d60b285eb7e9732f (Acceptable words that aren't solutions.)\n",
	"* https://gist.github.com/cfreshman/a7b776506c73284511034e63af1017ee (Solutions in alphabetical order.)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "a8dae571-fcc5-4a03-aa65-a88f6ec71ff4",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"12947"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"with open(\"nyt_word_list.txt\", \"rt\") as infile:\n",
	" word_list: list[str] = [k.upper() for k in infile.read().split(\"\\n\")]\n",
	"len(word_list)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "9a6dbc9a-5241-4ba9-bcac-7bc4e7751d6c",
	"metadata": {
	"tags": []
	},
	"source": [
	"## Letter data\n",
	"\n",
	"Using a `Counter`, we produce a `pandas.DataFrame` with the following information about each word:\n",
	"\n",
	"1. How many of each letter is present\n",
	"2. Which letter is in each position"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "92452d04-4ad6-4e28-93d9-ede2998a722d",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>A</th>\n",
	" <th>B</th>\n",
	" <th>C</th>\n",
	" <th>D</th>\n",
	" <th>E</th>\n",
	" <th>F</th>\n",
	" <th>G</th>\n",
	" <th>H</th>\n",
	" <th>I</th>\n",
	" <th>J</th>\n",
	" <th>...</th>\n",
	" <th>V</th>\n",
	" <th>W</th>\n",
	" <th>X</th>\n",
	" <th>Y</th>\n",
	" <th>Z</th>\n",
	" <th>pos0</th>\n",
	" <th>pos1</th>\n",
	" <th>pos2</th>\n",
	" <th>pos3</th>\n",
	" <th>pos4</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>AAHED</th>\n",
	" <td>2</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>A</td>\n",
	" <td>A</td>\n",
	" <td>H</td>\n",
	" <td>E</td>\n",
	" <td>D</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>AALII</th>\n",
	" <td>2</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>2</td>\n",
	" <td>0</td>\n",
	" <td>...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>A</td>\n",
	" <td>A</td>\n",
	" <td>L</td>\n",
	" <td>I</td>\n",
	" <td>I</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>AARGH</th>\n",
	" <td>2</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>A</td>\n",
	" <td>A</td>\n",
	" <td>R</td>\n",
	" <td>G</td>\n",
	" <td>H</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>AARTI</th>\n",
	" <td>2</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>A</td>\n",
	" <td>A</td>\n",
	" <td>R</td>\n",
	" <td>T</td>\n",
	" <td>I</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>ABACA</th>\n",
	" <td>3</td>\n",
	" <td>1</td>\n",
	" <td>1</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>...</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>0</td>\n",
	" <td>A</td>\n",
	" <td>B</td>\n",
	" <td>A</td>\n",
	" <td>C</td>\n",
	" <td>A</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"<p>5 rows × 31 columns</p>\n",
	"</div>"
	],
	"text/plain": [
	" A B C D E F G H I J ... V W X Y Z pos0 pos1 pos2 \\\n",
	"AAHED 2 0 0 1 1 0 0 1 0 0 ... 0 0 0 0 0 A A H \n",
	"AALII 2 0 0 0 0 0 0 0 2 0 ... 0 0 0 0 0 A A L \n",
	"AARGH 2 0 0 0 0 0 1 1 0 0 ... 0 0 0 0 0 A A R \n",
	"AARTI 2 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 A A R \n",
	"ABACA 3 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 A B A \n",
	"\n",
	" pos3 pos4 \n",
	"AAHED E D \n",
	"AALII I I \n",
	"AARGH G H \n",
	"AARTI T I \n",
	"ABACA C A \n",
	"\n",
	"[5 rows x 31 columns]"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"LETTERS: Final[list[str]] = [chr(65 + k) for k in range(26)]\n",
	"\n",
	"def letter_counter(w: str) -> tuple[int, ...]:\n",
	" c = Counter(w)\n",
	" return tuple([c[u] for u in LETTERS])\n",
	"\n",
	"\n",
	"letter_data = pd.DataFrame.from_dict(\n",
	" {word: letter_counter(word) for word in word_list}, orient=\"index\", columns=LETTERS\n",
	")\n",
	"\n",
	"for pos in range(5):\n",
	" letter_data[f\"pos{pos}\"] = letter_data.index.map(lambda x: x[pos])\n",
	"\n",
	"letter_data.head()"
	]
	},
	{
	"cell_type": "markdown",
	"id": "19e1b8aa-8524-4ca2-af64-4f282bffff88",
	"metadata": {},
	"source": [
	"## A filtering function\n",
	"\n",
	"Given a `solution` word and a `guess` word, we want to figure out which words are still possible. The game returns the following information:\n",
	"\n",
	"(1) It shows all the letters in the `guess` that are correct (same letter in the same position) in green. \n",
	"(2) If a letter in the `guess` is also in the `solution` but in a different position, it will be shown in yellow. However, the number of times the same letter in the `guess` is shaded green or yellow will not exceed the number of times that letter appears in the `solution`. For example, in the PIZZA example above, the second Z is shaded gray and not yellow because there is only one Z in the `solution`.\n",
	"(3) The remaining letters are shaded gray.\n",
	"\n",
	"We create a function to create a query that can be run on a subset of `letter_data`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"id": "83e29070-e218-4007-b3c6-07f4dd4b9899",
	"metadata": {},
	"outputs": [],
	"source": [
	"def make_query(solution: str, guess: str) -> str:\n",
	" query_items = []\n",
	" counted_letters = []\n",
	" for idx, item in enumerate(zip(solution.upper(), guess.upper())):\n",
	" if item[0] == item[1]:\n",
	" query_items.append(f\"pos{idx} == {item[1]!r}\")\n",
	" else:\n",
	" query_items.append(f\"pos{idx} != {item[1]!r}\")\n",
	" if item[1] not in counted_letters:\n",
	" counted_letters.append(item[1])\n",
	" if (count := guess.upper().count(item[1])) > (num := solution.upper().count(item[1])):\n",
	" query_items.append(f\"{item[1]} <= {num}\")\n",
	" elif count <= num:\n",
	" query_items.append(f\"{item[1]} >= {count}\")\n",
	" return \" and \".join(query_items)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "65de6b71-519e-4cc2-acb5-d772d458a7d7",
	"metadata": {},
	"source": [
	"Here's an example of the information we get from guessing PIZZA when the solution is FAZED:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "3cba8e15-ae6e-4b05-8734-ffd01c0ab4ed",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"\"pos0 != 'P' and P <= 0 and pos1 != 'I' and I <= 0 and pos2 == 'Z' and Z <= 1 and pos3 != 'Z' and pos4 != 'A' and A >= 1\""
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"make_query(\"fazed\", \"pizza\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "a86bf311-1729-4028-9a7f-509fc1de188d",
	"metadata": {},
	"source": [
	"We'll make a function to produce a list of the remaining possible words:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"id": "c762bc3f-7875-4d22-ad88-ea56321c006e",
	"metadata": {
	"tags": []
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Index(['ADZED', 'ADZES', 'BAZAR', 'BAZOO', 'DAZED', 'DAZER', 'DAZES', 'FAZED',\n",
	" 'FAZES', 'GAZAL', 'GAZAR', 'GAZED', 'GAZER', 'GAZES', 'GAZON', 'GAZOO',\n",
	" 'HAZAN', 'HAZED', 'HAZEL', 'HAZER', 'HAZES', 'KAZOO', 'LAZAR', 'LAZED',\n",
	" 'LAZES', 'LAZOS', 'MAZED', 'MAZER', 'MAZES', 'MAZEY', 'MAZUT', 'MUZAK',\n",
	" 'NAZES', 'RAZED', 'RAZEE', 'RAZER', 'RAZES', 'RAZOO', 'RAZOR', 'SAZES',\n",
	" 'WAZOO'],\n",
	" dtype='object')"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"def remaining_possible_words(solution: str, guess: str, test_list: Optional[Sequence[str]] = None) -> list[str]:\n",
	" if test_list is None:\n",
	" test_data = letter_data\n",
	" else:\n",
	" test_data = letter_data[letter_data.index.isin([x.upper() for x in test_list])]\n",
	" return test_data.query(make_query(solution, guess)).index\n",
	"\n",
	"remaining_possible_words(\"fazed\", \"pizza\")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "fb20939f-5134-43c3-9314-4cab992adf10",
	"metadata": {
	"tags": []
	},
	"source": [
	"## Finding the best initial words\n",
	"\n",
	"One naïve way to proceed is to compute `remaining_possible_words` for each `solution, guess` pair and then find the mean for each given `guess`. However, there are over 12,000 words in the word list, so testing every `solution, guess` pair would mean running `remaining_possible_words` more than 16 million times. So, instead, we will only test against a random sample of solutions and a random sample of words to rule out."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"id": "d4c9da3b-fbb9-4f66-bc8c-173171963305",
	"metadata": {
	"tags": []
	},
	"outputs": [],
	"source": [
	"solution_sample: list[str] = sample(word_list, 200)\n",
	"test_sample: list[str] = sample(word_list, 200)\n",
	"\n",
	"def score_word(guess, solution_list: Optional[Sequence[str]]=None, test_list: Optional[Sequence[str]]=None) -> float:\n",
	" if solution_list is None:\n",
	" solution_list = word_list\n",
	" solutions = np.array([len(remaining_possible_words(solution, guess, test_list)) for solution in solution_list])\n",
	" return np.mean(solutions)\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"id": "8c7f5fb7-d318-484b-918d-f69252a1e78b",
	"metadata": {},
	"source": [
	"For example, let's score PIZZA and AROSE."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "bcece640-ef45-4d4d-8073-ed8b7c18a78c",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"40.54\n",
	"6.53\n"
	]
	}
	],
	"source": [
	"for k in [\"PIZZA\", \"AROSE\"]:\n",
	" print(score_word(k, solution_sample, test_sample))\n"
	]
	},
	{
	"cell_type": "markdown",
	"id": "3268863f-3074-4c16-823a-ecc16e3c7369",
	"metadata": {},
	"source": [
	"That means AROSE is a better initial guess than PIZZA (remember, lower is better).\n",
	"\n",
	"## Finale\n",
	"\n",
	"Now, we are ready to score all of the words and show the best-scoring words. (This cell took my late-2020 Mac M1 Mini over an hour to evaluate.)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"id": "7adf9110-2a22-42eb-a060-4b593a682107",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"LARES 4.590\n",
	"LANES 4.665\n",
	"RALES 4.700\n",
	"ALOES 4.875\n",
	"TALES 4.925\n",
	"SALET 5.005\n",
	"ARLES 5.005\n",
	"LORES 5.025\n",
	"SALUE 5.025\n",
	"LAERS 5.045\n",
	"LEANS 5.085\n",
	"AEROS 5.115\n",
	"AEONS 5.150\n",
	"LEARS 5.170\n",
	"SOARE 5.185\n",
	"ROLES 5.195\n",
	"NARES 5.220\n",
	"EARLS 5.280\n",
	"TARES 5.280\n",
	"SOLER 5.325\n",
	"dtype: float64"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"word_scores = pd.Series(\n",
	" {word: score_word(word, solution_sample, test_sample) for word in word_list})\n",
	"\n",
	"word_scores.sort_values().head(20)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8dc94069-07ca-40fb-a881-fe471f9ef6b8",
	"metadata": {},
	"source": [
	"Note that most of these words end in S. That makes sense, since many words (plural nouns and third-person present-tense verbs) end with S. Guessing with a word ending in S would immediately either rule these words out or confirm that the solution ends with S.\n",
	"\n",
	"So next time you play Wordle, perhaps start with LARES, LANES, RALES, ALOES, or another word from this list."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "cc98e572-55f7-4a61-935c-5190368c364f",
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "data1",
	"language": "python",
	"name": "data1"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}