Skip to content

Instantly share code, notes, and snippets.

@christopherphan
Last active January 20, 2022 04:51
Show Gist options
  • Save christopherphan/f90c264dac1b0faeb8c0d4fc87155a73 to your computer and use it in GitHub Desktop.
Save christopherphan/f90c264dac1b0faeb8c0d4fc87155a73 to your computer and use it in GitHub Desktop.
The best initial word to play in Wordle, based on letter frequencies
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "40835e56-69ef-4cad-ad86-a0073d969ec1",
"metadata": {},
"source": [
"# Best starting Wordle word: Most common letters approach\n",
"\n",
"Christopher Phan <chrisphan.com>\n",
"\n",
"2022-01-19\n",
"\n",
"## Intro\n",
"\n",
"[Wordle](https://www.powerlanguage.co.uk/wordle/) is a word puzzle game and massive social media sensation. Exactly one puzzle is given every day, revealed at midnight local time. The goal is to determine the solution word in six or fewer guesses. The solution is always a five-letter word. After each guess, the game indicates which letters in the guessed word (1) are not in the solution word, (2) are in the solution word, but in the wrong positions, (3) are in the solution word in the same position.\n",
"\n",
"For example, if your guess was \"PIZZA\" and the solution word was \"FAZED\", the game would indicate with the following output:\n",
"\n",
"![](pizza.png)\n",
"\n",
"This indicates that the letters P, I and the second Z are not in the solution (there's only one Z), the first Z is in the correct location, and the A is in the solution but not in that position.\n",
"\n",
"In the game, the first guess is made without any information, so it can always be the same word each time. What is the best word to use?\n",
"\n",
"In this notebook, I will consider the stategy of using the first guess to try to gather information about the most common letter. Think of the bonus round of Wheel of Fortune, where they give the contestant some common letters for free. My goal is to find the five-letter word with (1) the most common letters and (2) every letter distinct.\n",
"\n",
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c1ecfadc-5daa-4a39-b6de-f46c1a393c65",
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"from dataclasses import dataclass\n",
"from typing import Final\n",
"from math import fsum\n",
"from random import sample\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "870d9097-3153-4470-a706-6b4eb1370369",
"metadata": {},
"source": [
"## Building the word list\n",
"\n",
"We are going to use the [Spell Checker Oriented Word Lists](http://wordlist.aspell.net) to make a list of 5-letter English words. \n",
"\n",
"From <http://wordlist.aspell.net> download SCOWL in a compressed archive, and then extract the archive.\n",
" \n",
"I was going to show the shell commands I used to do this (e.g. `curl`) but the file is hosted on SourceForge and there are a bunch of redirects.\n",
"\n",
"SCOWL's [README](http://wordlist.aspell.net/scowl-readme/) explains:\n",
"\n",
" Except for the special word lists the files follow the following\n",
" naming convention:\n",
" <spelling category>-<sub-category>.<size>\n",
" Where the spelling category is one of\n",
" english, american, british, british_z, canadian, australian\n",
" variant_1, variant_2, variant_3,\n",
" british_variant_1, british_variant_2, \n",
" canadian_variant_1, canadian_variant_2,\n",
" australian_variant_1, australian_variant_2\n",
" Sub-category is one of\n",
" abbreviations, contractions, proper-names, upper, words\n",
" And size is one of\n",
" 10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large), \n",
" 80 (huge), 95 (insane)\n",
" \n",
"We'll use the `words` subcategory of every `spelling category`, with a size of `70` or lower."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a8dae571-fcc5-4a03-aa65-a88f6ec71ff4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7662"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CATEGORIES: Final[list[str]] = [\n",
" \"english\",\n",
" \"american\",\n",
" \"british\",\n",
" \"british_z\",\n",
" \"canadian\",\n",
" \"australian\",\n",
" \"variant_1\",\n",
" \"variant_2\",\n",
" \"variant_3\",\n",
" \"british_variant_1\",\n",
" \"british_variant_2\",\n",
" \"canadian_variant_1\",\n",
" \"canadian_variant_2\",\n",
" \"australian_variant_1\",\n",
" \"australian_variant_2\",\n",
"]\n",
"MAX_SIZE: Final[int] = 70\n",
"SIZES: Final[list[int]] = [k for k in [10, 20, 35, 40, 50, 55, 60, 70, 80, 95] if k <= MAX_SIZE]\n",
"LETTERS: Final[list[str]] = [chr(65 + k) for k in range(26)]\n",
"filenames: list[str] = [\n",
" f\"{category}-words.{size:d}\" for category in CATEGORIES for size in SIZES\n",
"]\n",
"\n",
"word_list: list[str] = []\n",
"for file in filenames:\n",
" with open(\"scowl-2020.12.07/final/\" + file, \"rt\", errors=\"replace\") as infile:\n",
" word_list.extend(\n",
" [\n",
" k.upper().strip()\n",
" for k in infile.read().split(\"\\n\")\n",
" if (len(k) == 5 and all(u in LETTERS for u in k.upper()))\n",
" ]\n",
" )\n",
" \n",
"len(word_list)"
]
},
{
"cell_type": "markdown",
"id": "5004916c-5d58-49b2-9d31-be2c04497160",
"metadata": {},
"source": [
"Now, we have a good list of most relatively common, non-proper-noun, five-letter words in the English language. I'm going to save these words in case I want to use them later."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "38e7c2cc-7f4c-4b21-9ecf-bf2d9ed104a4",
"metadata": {},
"outputs": [],
"source": [
"with open(\"word_list.txt\", \"wt\") as outfile:\n",
" outfile.write(\"\\n\".join(word_list))"
]
},
{
"cell_type": "markdown",
"id": "9a6dbc9a-5241-4ba9-bcac-7bc4e7751d6c",
"metadata": {
"tags": []
},
"source": [
"## Letter frequency table\n",
"\n",
"We are going to figure out the relative frequency of letters in our list of five-letter words. We use a Python `counter` class to count the letters in each word, and put the result into a `pandas.DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "92452d04-4ad6-4e28-93d9-ede2998a722d",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>A</th>\n",
" <th>B</th>\n",
" <th>C</th>\n",
" <th>D</th>\n",
" <th>E</th>\n",
" <th>F</th>\n",
" <th>G</th>\n",
" <th>H</th>\n",
" <th>I</th>\n",
" <th>J</th>\n",
" <th>...</th>\n",
" <th>Q</th>\n",
" <th>R</th>\n",
" <th>S</th>\n",
" <th>T</th>\n",
" <th>U</th>\n",
" <th>V</th>\n",
" <th>W</th>\n",
" <th>X</th>\n",
" <th>Y</th>\n",
" <th>Z</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>ABOUT</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ABOVE</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ABUSE</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ACTED</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ADDED</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 26 columns</p>\n",
"</div>"
],
"text/plain": [
" A B C D E F G H I J ... Q R S T U V W X Y Z\n",
"ABOUT 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 1 1 0 0 0 0 0\n",
"ABOVE 1 1 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0\n",
"ABUSE 1 1 0 0 1 0 0 0 0 0 ... 0 0 1 0 1 0 0 0 0 0\n",
"ACTED 1 0 1 1 1 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0\n",
"ADDED 1 0 0 3 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0\n",
"\n",
"[5 rows x 26 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def letter_counter(w: str) -> tuple[int, ...]:\n",
" c = Counter(w)\n",
" return tuple([c[u] for u in LETTERS])\n",
"\n",
"\n",
"letter_freq_by_word = pd.DataFrame.from_dict(\n",
" {word: letter_counter(word) for word in word_list}, orient=\"index\", columns=LETTERS\n",
")\n",
"\n",
"letter_freq_by_word.head()"
]
},
{
"cell_type": "markdown",
"id": "77a1d02e-d7ee-4b3e-995b-d4a0a27e9c22",
"metadata": {},
"source": [
"For example, the word \"ADDED\" has 3 \"D\"s, so the \"D\" column has a 3. To get the letter counts, `sum` the rows."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "af15f81c-792a-475e-9c40-1b34c44cb703",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"A 2917\n",
"B 841\n",
"C 1179\n",
"D 1293\n",
"E 3505\n",
"F 624\n",
"G 842\n",
"H 930\n",
"I 1927\n",
"J 121\n",
"K 710\n",
"L 1885\n",
"M 982\n",
"N 1559\n",
"O 2221\n",
"P 1090\n",
"Q 61\n",
"R 2235\n",
"S 3395\n",
"T 1807\n",
"U 1268\n",
"V 384\n",
"W 531\n",
"X 146\n",
"Y 1007\n",
"Z 180\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"letter_freq_raw = letter_freq_by_word.sum(\"rows\")\n",
"letter_freq_raw"
]
},
{
"cell_type": "markdown",
"id": "d6f65605-d1e4-4ff8-9e1c-f62afd5722b7",
"metadata": {},
"source": [
"This tells you that in the 7662 words on our list, there are 2917 A's, 841 B's, etc. However, we would like the *relative* frequencies."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "ce370c63-4ab3-44f1-9fa1-1772f7183ef8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"E 10.419%\n",
"S 10.092%\n",
"A 8.671%\n",
"R 6.644%\n",
"O 6.602%\n",
"I 5.728%\n",
"L 5.603%\n",
"T 5.372%\n",
"N 4.634%\n",
"D 3.844%\n",
"U 3.769%\n",
"C 3.505%\n",
"P 3.240%\n",
"Y 2.993%\n",
"M 2.919%\n",
"H 2.765%\n",
"G 2.503%\n",
"B 2.500%\n",
"K 2.111%\n",
"F 1.855%\n",
"W 1.578%\n",
"V 1.141%\n",
"Z 0.535%\n",
"X 0.434%\n",
"J 0.360%\n",
"Q 0.181%\n",
"dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"letter_freq = letter_freq_raw / sum(letter_freq_raw)\n",
"letter_freq.sort_values(ascending=False).map(lambda x: f\"{x:.3%}\")"
]
},
{
"cell_type": "markdown",
"id": "b7e4d0db-8f95-46f6-8485-7d4c10f2b024",
"metadata": {},
"source": [
"Now we know which letters are the most common in *five-letter* words. (A word frequency table for all words might be different.)\n",
"\n",
"Note that at this point, if you take a close look at the five most frequent letters, you might be able to figure out what the best word is, according to our criteria.\n",
" \n",
"## Finding the best initial words\n",
"\n",
"First, we define a way to score a word, by adding up the relative frequencies for each letter."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "046a4369-cae3-4db6-92ca-434f9215e031",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def score_word(w: str) -> float:\n",
" return fsum([letter_freq[letter.upper()] for letter in w])"
]
},
{
"cell_type": "markdown",
"id": "ec5c4cb9-de52-4f69-b7e0-50e79b76b3be",
"metadata": {},
"source": [
"For example, the word ZESTY has a lower score than RACES:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "753267da-f456-46e8-8126-ba77622f38e2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.2941141498216409 0.39331153388822826\n"
]
}
],
"source": [
"print(score_word(\"zesty\"), score_word(\"races\"))"
]
},
{
"cell_type": "markdown",
"id": "3c014564-da5c-41db-b6d7-0cec616701c9",
"metadata": {},
"source": [
"We are looking for the highest-scoring word with the *no repeated letters*. Start by making a list of words where all five letters are distinct."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6b915fb5-54a1-4c19-ae0a-954ce967a35a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4435"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"words_unique_letters = letter_freq_by_word[\n",
" letter_freq_by_word.index.map(lambda x: len(set(x)) == 5)\n",
"].copy()\n",
"len(words_unique_letters)"
]
},
{
"cell_type": "markdown",
"id": "b8d7b6c2-1995-4217-ba59-c2ac543f20ef",
"metadata": {},
"source": [
"Note that any word and an anagram will have the same score:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "4feef5ba-9479-443b-9705-3b7f8d738321",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.3384066587395957 0.3384066587395957\n"
]
}
],
"source": [
"print(score_word(\"large\"), score_word(\"regal\"))"
]
},
{
"cell_type": "markdown",
"id": "49c4b145-cd17-4092-9b3a-61bfaaf517d4",
"metadata": {},
"source": [
"Hence, we can consider any word and any of its anagrams as the same. Let's divide up the words by angagram set."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "009efe94-2ba8-4a13-9fb3-9c734314b23f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Some example anagram groups\n",
"\n",
"BUYER\n",
"YOKED\n",
"GRANT\n",
"WRAPT\n",
"PAINS, NIPAS\n",
"SUMAC\n",
"WIRES, WISER, WEIRS\n",
"MOUSE, MOUES\n",
"PROMS, ROMPS\n",
"MEINY\n",
"WIZES\n",
"BLUSH, BUHLS\n",
"LEMAN\n",
"GLARY, GYRAL\n",
"VENTS\n",
"PINEY\n",
"CHIVE\n",
"DWARF\n",
"BARDS, DRABS, BRADS\n",
"GIRLS\n"
]
}
],
"source": [
"def sort_letters(x: str) -> str:\n",
" return \"\".join(sorted(x))\n",
"\n",
"words_unique_letters[\"sorted\"] = words_unique_letters.index.map(sort_letters)\n",
"unique_letter_sets = set(words_unique_letters[\"sorted\"])\n",
"\n",
"def make_group(x: str) -> list[str]:\n",
" return list(words_unique_letters[words_unique_letters[\"sorted\"] == sort_letters(x)].index)\n",
"\n",
"print(\"Some example anagram groups\\n\")\n",
"anagram_groups = {key : make_group(key) for key in unique_letter_sets}\n",
"for word in sample(list(anagram_groups.keys()), 20):\n",
" print(', '.join(anagram_groups[word]))"
]
},
{
"cell_type": "markdown",
"id": "6c0f7ee2-4c57-4f8b-b36c-e9f3b5f034e2",
"metadata": {
"tags": []
},
"source": [
"## Finale\n",
"\n",
"Now, we are ready to score all of them and give the top 10 highest-scoring words."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "075981e5-8502-4c3c-9e72-5455d7025d48",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>anagram set</th>\n",
" <th>score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>AROSE</th>\n",
" <td>AROSE</td>\n",
" <td>0.424287</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ARISE</th>\n",
" <td>ARISE, RAISE, SERAI</td>\n",
" <td>0.415547</td>\n",
" </tr>\n",
" <tr>\n",
" <th>LASER</th>\n",
" <td>LASER, EARLS, REALS, LARES, RALES</td>\n",
" <td>0.414298</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ALOES</th>\n",
" <td>ALOES</td>\n",
" <td>0.413882</td>\n",
" </tr>\n",
" <tr>\n",
" <th>RATES</th>\n",
" <td>RATES, STARE, TEARS, ASTER, TARES, TASER, RESAT</td>\n",
" <td>0.411980</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" anagram set score\n",
"AROSE AROSE 0.424287\n",
"ARISE ARISE, RAISE, SERAI 0.415547\n",
"LASER LASER, EARLS, REALS, LARES, RALES 0.414298\n",
"ALOES ALOES 0.413882\n",
"RATES RATES, STARE, TEARS, ASTER, TARES, TASER, RESAT 0.411980"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_scores = pd.DataFrame.from_dict(\n",
" {group[0]: [\", \".join(group), score_word(group[0])] for group in anagram_groups.values()},\n",
" orient=\"index\",\n",
" columns=[\"anagram set\", \"score\"],\n",
")\n",
"word_scores.sort_values(\"score\", ascending=False).head()"
]
},
{
"cell_type": "markdown",
"id": "ee70c0c3-51d3-45f2-aba3-6890f9ca1417",
"metadata": {},
"source": [
"The next time you play Wordle, perhaps start with AROSE?\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "data1",
"language": "python",
"name": "data1"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment