yunruse/pangram.ipynb

## pangram.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Toki Pona Pangram Maker\n",
    "The language Toki Pona has only around 137 words, due to its intended minimalism.\n",
    "\n",
    "For example, \"car\" could be said as \"ilo tawa\", literally \"tool go\", or \"tool of going\". This is inherently ambiguous – that could also mean bicycle – but this is intentional: you talk about what you can see in front of you, and only specify what is necessary.\n",
    "\n",
    "In keeping with its theme, the language has an extremely simple syllable structure.\n",
    "A syllable consists of an consonant (`ptksmnljw`) followed by a vowel (`aiueo`) followed by an optional `n`. A word can start without a consonant. The sequences `ji`, `wu`, `wo`, `ti`, `nn` and `nm` cannot occur, as they are ambiguous to some listeners."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A pangram is a sentence containing all letters. In English, this is most well known as \"The quick brown fox jumps over the lazy dog.\" In Toki Pona, many exist, such as:\n",
    "\n",
    "> musi jo li tenpo weka.\n",
    ">\n",
    "> _Humour held is time discarded._ (So don't waste too much time!)\n",
    ">\n",
    "> – [Nathan McCoy](https://www.reddit.com/r/tokipona/comments/h9bqk3/some_minimal_pangrams/)\n",
    "\n",
    "Bit of a grim saying, though, right: what if we generated our own?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have a list of words [https://lipu-linku.github.io], and we'll use both _pu_ ('official') and _ku lili_ ('unofficial' but highly recognised) words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = '''\n",
    "a akesi ala alasa ale ali anpa ante anu awen e esun ijo ike ilo insa jaki jan jelo jo kala kalama kama kasi ken kepeken kili kiwen ko kon kule kulupu kute la lape laso lawa len lete li lili linja lipu lon luka lukin lupa ma mama mani meli mi mije moku moli monsi mu mun musi mute nanpa nasa nasin nena ninimi noka o olin ona open pakala pali palisa pan pana pi pilin pimeja pini pipi poka poki pona sama seli selo seme sewi sijelo sike sin sina sinpin sitelen sona soweli suli suno supa suwi tan taso tawa telo tenpo toki tomo tu unpa uta utala walo wan waso wawa weka wile\n",
    "\n",
    "epiku jasima kijetesantakalu kin kipisi lanpan leko meso misikeke monsuta namako oko soko tonsi\n",
    "'''.split()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The smallest pangram possible in Toki Pona is 17 characters – 8 `CV` syllables, one of them ending in `n`. Reasonably, we can enumerate this. Because we have every list of words, and each syllable is very easy to systematically evaluate, we can generate _every_ possible pangram.\n",
    "\n",
    "Naïvely, we could split this into two parts:\n",
    "1. Generate every possible string of syllables.\n",
    "2. Try to parse this into a series of words.\n",
    "\n",
    "We can't practically store the results of step 1 – there are _trillions_ of 8-syllable sequences possible.\n",
    "\n",
    "Instead, let's merge the steps together. We'll do a depth-first search, syllable by syllable. Crucially, we'll make sure we're not looking for syllables where there aren't even any words to start with. For example, we would skip at `esun pe` (no word starts with `pe`) or `esun jela` (no word starts with `jela`).\n",
    "\n",
    "Let's start by removing all the words that don't fit into our 17-char pangram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['jaki', 'jan', 'jelo', 'jo', 'kala', 'kalama', 'kama', 'kasi', 'ken', 'kili', 'kiwen', 'ko', 'kon', 'kule', 'kulupu', 'kute', 'la', 'lape', 'laso', 'lawa', 'len', 'lete', 'li', 'linja', 'lipu', 'lon', 'luka', 'lukin', 'lupa', 'ma', 'meli', 'mi', 'mije', 'moku', 'moli', 'monsi', 'mu', 'mun', 'musi', 'mute', 'pakala', 'pali', 'palisa', 'pan', 'pi', 'pilin', 'pimeja', 'poka', 'poki', 'sama', 'seli', 'selo', 'seme', 'sewi', 'sijelo', 'sike', 'sin', 'sitelen', 'soweli', 'suli', 'supa', 'suwi', 'tan', 'taso', 'tawa', 'telo', 'tenpo', 'toki', 'tomo', 'tu', 'walo', 'wan', 'waso', 'weka', 'wile', 'jasima', 'kin', 'kipisi', 'leko', 'meso', 'monsuta', 'soko', 'tonsi']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "83"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "CONSONANTS = 'ptksmnljw'\n",
    "VOWELS = 'aeiou'\n",
    "wp = words_pangram = [w for w in words if all((\n",
    "    all(not w.startswith(v) for v in VOWELS),      # can't start with a vowel\n",
    "    all(not 'n'+v in w      for v in VOWELS),      # `n` should only end a syllable\n",
    "    all(w.count(c) <= 1     for c in CONSONANTS),  # no repeated consonants\n",
    "))]\n",
    "print(wp)\n",
    "len(wp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In theory, we can now simply run, depth-first, over each word. For each word we look at, we'll make sure we're not re-using a consonant. So, let's pre-process the consonants, shall we?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "wp2 = [(w, set(w) - set('aiueo')) for w in wp]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay! Now to evaluate. We'll use a recursive generator to do this. This will make use of the above set structure to ensure words can't be repeated.\n",
    "\n",
    "The `filter` function can be used to filter out fragments. For example, no valid Toki Pona sentence starts with `li` – so by filtering early, we make sure we don't generate the trillions of sequences that would have passed by. It takes a list of words, and also a boolean indicating if the sequence is finished."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Callable, List\n",
    "\n",
    "FilterFunc = Callable[[List[str], bool], bool]\n",
    "\n",
    "def pangrams(filter: FilterFunc = lambda x: True, start=None):\n",
    "    return _pangrams(filter, start or [], set())\n",
    "\n",
    "FINAL_SEEN = set(CONSONANTS + 'naiueo')\n",
    "\n",
    "def _pangrams(filter: FilterFunc, words: List[str], seen: set):\n",
    "    done = False\n",
    "    all_letters = ''.join(words)\n",
    "    l = len(all_letters)\n",
    "    if l > 17:\n",
    "        return\n",
    "    done = l == 17\n",
    "    if l and not filter(words, done):\n",
    "        return\n",
    "    if done:\n",
    "        # It's gotta be a pangram!\n",
    "        if set(all_letters) == FINAL_SEEN:\n",
    "            yield words\n",
    "    else:\n",
    "        for w, consonants in wp2:\n",
    "            if not seen & consonants:\n",
    "                yield from _pangrams(filter, words + [w], seen | consonants)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a good generator, and quite efficient. However, there are a lot of pangrams, and most of them are nonsense. Surprisingly, however, so long as we place a condition or two – for example, the presence of `li`, and starting with `tenpo`, as below – we can generate a list of pangrams in no time. Give it a spin, and add your own filters!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "PARTICLES = ('li', 'pi', 'la')\n",
    "\n",
    "def particles_no_edges(words, finished):\n",
    "    for p in PARTICLES:\n",
    "        if words[0] == p:\n",
    "            return False\n",
    "        elif finished and words[-1] == p:\n",
    "            return False\n",
    "        return True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def beautify(words, done):\n",
    "    if not particles_no_edges(words, done):\n",
    "        return False\n",
    "    \n",
    "    # if done:\n",
    "    #     if len(words) < 5:\n",
    "    #         return False\n",
    "    # else:\n",
    "    #     if sum(len(w) == 2 for w in words) > 1:\n",
    "    #         return False\n",
    "\n",
    "    def d(*targets):\n",
    "        '''return True if the word(s) can't be used in the sequence'''\n",
    "        for t in targets:\n",
    "            cons = set(t) - set('aiueo')\n",
    "            for w in words:\n",
    "                if w != t:\n",
    "                    if set(w) & cons:\n",
    "                        return True\n",
    "    \n",
    "    # if d('jan'):\n",
    "    #     return False\n",
    "    \n",
    "    return True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "ps = list(pangrams(beautify))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "936264\n"
     ]
    }
   ],
   "source": [
    "print(len(ps))\n",
    "# for i in ps:\n",
    "#     print(' '.join(i))"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
  },
  "kernelspec": {
   "display_name": "Python 3.8.9 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Toki Pona Pangram Maker\n",
	"The language Toki Pona has only around 137 words, due to its intended minimalism.\n",
	"\n",
	"For example, \"car\" could be said as \"ilo tawa\", literally \"tool go\", or \"tool of going\". This is inherently ambiguous – that could also mean bicycle – but this is intentional: you talk about what you can see in front of you, and only specify what is necessary.\n",
	"\n",
	"In keeping with its theme, the language has an extremely simple syllable structure.\n",
	"A syllable consists of an consonant (`ptksmnljw`) followed by a vowel (`aiueo`) followed by an optional `n`. A word can start without a consonant. The sequences `ji`, `wu`, `wo`, `ti`, `nn` and `nm` cannot occur, as they are ambiguous to some listeners."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"A pangram is a sentence containing all letters. In English, this is most well known as \"The quick brown fox jumps over the lazy dog.\" In Toki Pona, many exist, such as:\n",
	"\n",
	"> musi jo li tenpo weka.\n",
	">\n",
	"> _Humour held is time discarded._ (So don't waste too much time!)\n",
	">\n",
	"> – [Nathan McCoy](https://www.reddit.com/r/tokipona/comments/h9bqk3/some_minimal_pangrams/)\n",
	"\n",
	"Bit of a grim saying, though, right: what if we generated our own?"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We have a list of words [https://lipu-linku.github.io], and we'll use both _pu_ ('official') and _ku lili_ ('unofficial' but highly recognised) words:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"words = '''\n",
	"a akesi ala alasa ale ali anpa ante anu awen e esun ijo ike ilo insa jaki jan jelo jo kala kalama kama kasi ken kepeken kili kiwen ko kon kule kulupu kute la lape laso lawa len lete li lili linja lipu lon luka lukin lupa ma mama mani meli mi mije moku moli monsi mu mun musi mute nanpa nasa nasin nena ninimi noka o olin ona open pakala pali palisa pan pana pi pilin pimeja pini pipi poka poki pona sama seli selo seme sewi sijelo sike sin sina sinpin sitelen sona soweli suli suno supa suwi tan taso tawa telo tenpo toki tomo tu unpa uta utala walo wan waso wawa weka wile\n",
	"\n",
	"epiku jasima kijetesantakalu kin kipisi lanpan leko meso misikeke monsuta namako oko soko tonsi\n",
	"'''.split()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The smallest pangram possible in Toki Pona is 17 characters – 8 `CV` syllables, one of them ending in `n`. Reasonably, we can enumerate this. Because we have every list of words, and each syllable is very easy to systematically evaluate, we can generate _every_ possible pangram.\n",
	"\n",
	"Naïvely, we could split this into two parts:\n",
	"1. Generate every possible string of syllables.\n",
	"2. Try to parse this into a series of words.\n",
	"\n",
	"We can't practically store the results of step 1 – there are _trillions_ of 8-syllable sequences possible.\n",
	"\n",
	"Instead, let's merge the steps together. We'll do a depth-first search, syllable by syllable. Crucially, we'll make sure we're not looking for syllables where there aren't even any words to start with. For example, we would skip at `esun pe` (no word starts with `pe`) or `esun jela` (no word starts with `jela`).\n",
	"\n",
	"Let's start by removing all the words that don't fit into our 17-char pangram."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['jaki', 'jan', 'jelo', 'jo', 'kala', 'kalama', 'kama', 'kasi', 'ken', 'kili', 'kiwen', 'ko', 'kon', 'kule', 'kulupu', 'kute', 'la', 'lape', 'laso', 'lawa', 'len', 'lete', 'li', 'linja', 'lipu', 'lon', 'luka', 'lukin', 'lupa', 'ma', 'meli', 'mi', 'mije', 'moku', 'moli', 'monsi', 'mu', 'mun', 'musi', 'mute', 'pakala', 'pali', 'palisa', 'pan', 'pi', 'pilin', 'pimeja', 'poka', 'poki', 'sama', 'seli', 'selo', 'seme', 'sewi', 'sijelo', 'sike', 'sin', 'sitelen', 'soweli', 'suli', 'supa', 'suwi', 'tan', 'taso', 'tawa', 'telo', 'tenpo', 'toki', 'tomo', 'tu', 'walo', 'wan', 'waso', 'weka', 'wile', 'jasima', 'kin', 'kipisi', 'leko', 'meso', 'monsuta', 'soko', 'tonsi']\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"83"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"CONSONANTS = 'ptksmnljw'\n",
	"VOWELS = 'aeiou'\n",
	"wp = words_pangram = [w for w in words if all((\n",
	" all(not w.startswith(v) for v in VOWELS), # can't start with a vowel\n",
	" all(not 'n'+v in w for v in VOWELS), # `n` should only end a syllable\n",
	" all(w.count(c) <= 1 for c in CONSONANTS), # no repeated consonants\n",
	"))]\n",
	"print(wp)\n",
	"len(wp)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In theory, we can now simply run, depth-first, over each word. For each word we look at, we'll make sure we're not re-using a consonant. So, let's pre-process the consonants, shall we?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"wp2 = [(w, set(w) - set('aiueo')) for w in wp]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Okay! Now to evaluate. We'll use a recursive generator to do this. This will make use of the above set structure to ensure words can't be repeated.\n",
	"\n",
	"The `filter` function can be used to filter out fragments. For example, no valid Toki Pona sentence starts with `li` – so by filtering early, we make sure we don't generate the trillions of sequences that would have passed by. It takes a list of words, and also a boolean indicating if the sequence is finished."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"from typing import Callable, List\n",
	"\n",
	"FilterFunc = Callable[[List[str], bool], bool]\n",
	"\n",
	"def pangrams(filter: FilterFunc = lambda x: True, start=None):\n",
	" return _pangrams(filter, start or [], set())\n",
	"\n",
	"FINAL_SEEN = set(CONSONANTS + 'naiueo')\n",
	"\n",
	"def _pangrams(filter: FilterFunc, words: List[str], seen: set):\n",
	" done = False\n",
	" all_letters = ''.join(words)\n",
	" l = len(all_letters)\n",
	" if l > 17:\n",
	" return\n",
	" done = l == 17\n",
	" if l and not filter(words, done):\n",
	" return\n",
	" if done:\n",
	" # It's gotta be a pangram!\n",
	" if set(all_letters) == FINAL_SEEN:\n",
	" yield words\n",
	" else:\n",
	" for w, consonants in wp2:\n",
	" if not seen & consonants:\n",
	" yield from _pangrams(filter, words + [w], seen \| consonants)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This is a good generator, and quite efficient. However, there are a lot of pangrams, and most of them are nonsense. Surprisingly, however, so long as we place a condition or two – for example, the presence of `li`, and starting with `tenpo`, as below – we can generate a list of pangrams in no time. Give it a spin, and add your own filters!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"PARTICLES = ('li', 'pi', 'la')\n",
	"\n",
	"def particles_no_edges(words, finished):\n",
	" for p in PARTICLES:\n",
	" if words[0] == p:\n",
	" return False\n",
	" elif finished and words[-1] == p:\n",
	" return False\n",
	" return True"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"def beautify(words, done):\n",
	" if not particles_no_edges(words, done):\n",
	" return False\n",
	" \n",
	" # if done:\n",
	" # if len(words) < 5:\n",
	" # return False\n",
	" # else:\n",
	" # if sum(len(w) == 2 for w in words) > 1:\n",
	" # return False\n",
	"\n",
	" def d(*targets):\n",
	" '''return True if the word(s) can't be used in the sequence'''\n",
	" for t in targets:\n",
	" cons = set(t) - set('aiueo')\n",
	" for w in words:\n",
	" if w != t:\n",
	" if set(w) & cons:\n",
	" return True\n",
	" \n",
	" # if d('jan'):\n",
	" # return False\n",
	" \n",
	" return True"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"ps = list(pangrams(beautify))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"936264\n"
	]
	}
	],
	"source": [
	"print(len(ps))\n",
	"# for i in ps:\n",
	"# print(' '.join(i))"
	]
	}
	],
	"metadata": {
	"interpreter": {
	"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
	},
	"kernelspec": {
	"display_name": "Python 3.8.9 64-bit",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.9"
	},
	"orig_nbformat": 4
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}