Skip to content

Instantly share code, notes, and snippets.

@analyticascent
Created May 11, 2020 22:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save analyticascent/101020d433fd4893c22e0fa03a948c48 to your computer and use it in GitHub Desktop.
Save analyticascent/101020d433fd4893c22e0fa03a948c48 to your computer and use it in GitHub Desktop.
Generating a list of words that only use certain frequently used letters. Reduces learning curve for teaching a child how to read.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple Reading Curve Demo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a proof of concept for a _single feature_ in a much bigger edtech project I have in the works. It illustrates how someone can generate a list of common words that only use some of the most frequently used letters in the alphabet. The result allows a first time English reader to learn how to read with less effort over time. Rather than learning how to sound out every letter all at once before reading, someone can begin with letters and words that are most common, and work their way up from there.\n",
"\n",
"Only two things are needed as starting input:\n",
"\n",
"* A list of commonly used words in the English language\n",
"* A frequency distribution of how often letters are used\n",
"\n",
"For the word list, I used a collection generated from Google's n-gram corpus (https://github.com/first20hours/google-10000-english), and for letter frequency I made use of the following distribution list from Wikipedia:\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![letter frequency](https://upload.wikimedia.org/wikipedia/commons/b/b0/English_letter_frequency_%28frequency%29.svg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both of these could be generated by hand in Python using an existing corpus of text, some n-gram counts, and character frequency analysis, but I'm sticking with the two above resources for simplicity.\n",
"\n",
"**Now for the code itself:**"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# this will be used to read in the word list\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# reading in the word list from the url it's hosted on and storing it in 'df'\n",
"\n",
"url = 'https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa-no-swears.txt'\n",
"df = pd.read_csv(url, sep=\" \", header=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" \n",
"\n",
"We should now have a list of common English words loaded into a data frame and stored within the `df` variable.\n",
"\n",
"**Let's check if it loaded properly:**"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>the</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>of</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>and</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>to</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>a</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"0 the\n",
"1 of\n",
"2 and\n",
"3 to\n",
"4 a"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# checking the first five words of the list\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>9879</th>\n",
" <td>varieties</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9880</th>\n",
" <td>arbor</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9881</th>\n",
" <td>mediawiki</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9882</th>\n",
" <td>configurations</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9883</th>\n",
" <td>poison</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"9879 varieties\n",
"9880 arbor\n",
"9881 mediawiki\n",
"9882 configurations\n",
"9883 poison"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# checking the last five words of the list\n",
"\n",
"df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"&nbsp;\n",
"\n",
"So far so good. Now we need to decide which letters we want to keep or exclude from the final list of words.\n",
"\n",
"Our focus will mostly be on letters that are used the most frequently, but also which ones would be necessary to have a wide array of words to choose from when making simple sentences later.\n",
"\n",
"**Here's what I chose:**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# a list of the common letters we will use, and the ones to be excluded\n",
"\n",
"common_letters = ['e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'c']\n",
"\n",
"letters_exclude = ['b', 'd', 'f', 'g', 'j', 'k', 'l', 'm', 'p', 'q', 'u', 'v', 'w', 'x', 'y', 'z']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"&nbsp;\n",
"\n",
"Next, we need a function that can check a word on the list and see if it contains _any_ of the letters we want to exclude.\n",
"\n",
"The function below works as follows:\n",
"\n",
"* If any letter in a given word is inside the `letters_exclude` list...\n",
"* Return `False`\n",
"* Otherwise, return `True`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# a function that checks if a word has any of the letters we want to exclude\n",
"\n",
"def check(s):\n",
" if any(l in s for l in letters_exclude):\n",
" return False\n",
" else:\n",
" return True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"&nbsp;\n",
"\n",
"Now lets see if that function works by having it display each word from the original list that meets the letter criteria:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the\n",
"to\n",
"a\n",
"in\n",
"is\n",
"on\n",
"that\n",
"this\n",
"i\n",
"it\n",
"not\n",
"or\n",
"are\n",
"at\n",
"as\n",
"an\n",
"can\n",
"has\n",
"search\n",
"one\n",
"other\n",
"no\n",
"site\n",
"he\n",
"their\n",
"there\n",
"see\n",
"so\n",
"his\n",
"here\n",
"c\n",
"e\n",
"s\n",
"these\n",
"its\n",
"than\n",
"state\n",
"into\n",
"n\n",
"re\n",
"her\n",
"t\n",
"then\n",
"each\n",
"she\n",
"r\n",
"set\n",
"center\n",
"store\n",
"those\n",
"car\n",
"states\n",
"area\n",
"o\n",
"case\n",
"care\n",
"three\n",
"h\n",
"access\n",
"north\n",
"art\n",
"since\n",
"rate\n",
"sites\n",
"non\n",
"teen\n",
"too\n",
"estate\n",
"note\n",
"action\n",
"start\n",
"series\n",
"air\n",
"hot\n",
"cost\n",
"test\n",
"cart\n",
"san\n",
"street\n",
"recent\n",
"stores\n",
"act\n",
"rates\n",
"create\n",
"east\n",
"ii\n",
"ca\n",
"oct\n",
"china\n",
"star\n",
"areas\n",
"rss\n",
"enter\n",
"share\n",
"net\n",
"co\n",
"notice\n",
"once\n",
"others\n",
"cars\n",
"short\n",
"arts\n",
"et\n",
"st\n",
"costs\n",
"either\n",
"centre\n",
"tech\n",
"en\n",
"heart\n",
"choose\n",
"error\n",
"sort\n",
"cases\n",
"none\n",
"chat\n",
"near\n",
"oh\n",
"shoes\n",
"notes\n",
"cash\n",
"seen\n",
"screen\n",
"soon\n",
"across\n",
"season\n",
"casino\n",
"cross\n",
"rather\n",
"career\n",
"teens\n",
"sat\n",
"nice\n",
"score\n",
"sent\n",
"choice\n",
"hi\n",
"artist\n",
"asian\n",
"inn\n",
"cnet\n",
"inc\n",
"cancer\n",
"reason\n",
"sea\n",
"anti\n",
"earth\n",
"hair\n",
"cities\n",
"tree\n",
"ie\n",
"horse\n",
"stars\n",
"est\n",
"son\n",
"iii\n",
"senior\n",
"entire\n",
"asia\n",
"int\n",
"rest\n",
"hit\n",
"sense\n",
"race\n",
"etc\n",
"core\n",
"sets\n",
"rent\n",
"host\n",
"ohio\n",
"sector\n",
"coast\n",
"hear\n",
"ten\n",
"hits\n",
"th\n",
"cat\n",
"na\n",
"chris\n",
"os\n",
"nation\n",
"sheet\n",
"resort\n",
"chance\n",
"stone\n",
"tests\n",
"root\n",
"ice\n",
"shot\n",
"nc\n",
"scott\n",
"sec\n",
"canon\n",
"chair\n",
"shirt\n",
"sc\n",
"heat\n",
"nor\n",
"santa\n",
"se\n",
"saint\n",
"rose\n",
"errors\n",
"ac\n",
"rich\n",
"ar\n",
"sa\n",
"hire\n",
"ones\n",
"corner\n",
"chain\n",
"reach\n",
"inch\n",
"chart\n",
"cc\n",
"shirts\n",
"senate\n",
"ct\n",
"icon\n",
"cast\n",
"stats\n",
"hr\n",
"iron\n",
"ne\n",
"ocean\n",
"train\n",
"con\n",
"nt\n",
"es\n",
"cent\n",
"secret\n",
"aa\n",
"assets\n",
"assist\n",
"rare\n",
"rise\n",
"static\n",
"scene\n",
"eat\n",
"seat\n",
"ann\n",
"soccer\n",
"ch\n",
"christ\n",
"inches\n",
"rs\n",
"shares\n",
"cisco\n",
"tea\n",
"trees\n",
"easier\n",
"src\n",
"nine\n",
"eric\n",
"ratio\n",
"rain\n",
"onto\n",
"tennis\n",
"stress\n",
"ss\n",
"irish\n",
"acc\n",
"charts\n",
"tn\n",
"noise\n",
"sister\n",
"ce\n",
"coach\n",
"hat\n",
"cheats\n",
"iran\n",
"costa\n",
"acts\n",
"cotton\n",
"starts\n",
"scores\n",
"nh\n",
"rear\n",
"ia\n",
"ha\n",
"ea\n",
"chosen\n",
"sarah\n",
"hate\n",
"rice\n",
"raise\n",
"iso\n",
"catch\n",
"sir\n",
"earn\n",
"const\n",
"insert\n",
"res\n",
"sit\n",
"char\n",
"shots\n",
"crisis\n",
"treat\n",
"cs\n",
"echo\n",
"sheets\n",
"teach\n",
"nasa\n",
"si\n",
"css\n",
"threat\n",
"anne\n",
"asset\n",
"scan\n",
"sci\n",
"sin\n",
"cr\n",
"ee\n",
"inner\n",
"tone\n",
"ethics\n",
"stereo\n",
"taste\n",
"cache\n",
"er\n",
"seats\n",
"era\n",
"honor\n",
"cheese\n",
"coins\n",
"horror\n",
"shoe\n",
"ethnic\n",
"ran\n",
"actor\n",
"sr\n",
"nr\n",
"horses\n",
"thin\n",
"harris\n",
"chairs\n",
"sierra\n",
"cats\n",
"tr\n",
"ron\n",
"hist\n",
"crash\n",
"inter\n",
"te\n",
"sean\n",
"tion\n",
"hence\n",
"ear\n",
"tie\n",
"ian\n",
"ra\n",
"rc\n",
"rico\n",
"cst\n",
"ceo\n",
"ec\n",
"ross\n",
"anna\n",
"throat\n",
"sri\n",
"toe\n",
"trans\n",
"acres\n",
"nec\n",
"ease\n",
"arena\n",
"ri\n",
"rt\n",
"sensor\n",
"thai\n",
"scenes\n",
"icons\n",
"roses\n",
"chest\n",
"shorts\n",
"ah\n",
"tones\n",
"hearts\n",
"ns\n",
"carter\n",
"sons\n",
"hrs\n",
"ta\n",
"shoot\n",
"assess\n",
"stones\n",
"roots\n",
"shore\n",
"ieee\n",
"ho\n",
"sh\n",
"ae\n",
"titans\n",
"herein\n",
"rio\n",
"hs\n",
"hero\n",
"ai\n",
"ot\n",
"arc\n",
"hosts\n",
"coat\n",
"rica\n",
"actors\n",
"ion\n",
"ic\n",
"terror\n",
"intro\n",
"ent\n",
"ts\n",
"aaron\n",
"trace\n",
"ncaa\n",
"intent\n",
"tt\n",
"tee\n",
"hats\n",
"sharon\n",
"rr\n",
"titten\n",
"ace\n",
"tons\n",
"honest\n",
"chi\n",
"chase\n",
"athens\n",
"seo\n",
"nissan\n",
"ins\n",
"norton\n",
"tc\n",
"corn\n",
"tin\n",
"heroes\n",
"ir\n",
"ties\n",
"rat\n",
"ranch\n",
"toner\n",
"nose\n",
"thesis\n",
"cents\n",
"ti\n",
"sees\n",
"aaa\n",
"oo\n",
"coin\n",
"arch\n",
"ni\n",
"thats\n",
"asin\n",
"reset\n",
"tri\n",
"nn\n",
"chains\n",
"noon\n",
"cheat\n",
"teeth\n",
"tan\n",
"races\n",
"hon\n",
"attach\n",
"chose\n",
"nascar\n",
"tears\n",
"oasis\n",
"ist\n",
"cnn\n",
"tire\n",
"strain\n",
"scsi\n",
"inns\n",
"ash\n",
"easter\n",
"ci\n",
"nano\n",
"retain\n",
"chaos\n",
"rats\n",
"anchor\n",
"stat\n",
"thee\n",
"rec\n",
"ciao\n",
"ton\n",
"hints\n",
"oe\n",
"techno\n",
"cant\n",
"trains\n",
"arise\n",
"irc\n",
"sara\n",
"chess\n",
"oscar\n",
"strict\n",
"cet\n",
"tries\n",
"acer\n",
"ons\n",
"reno\n",
"horn\n",
"tires\n",
"retro\n",
"ati\n",
"rna\n",
"scotia\n",
"eco\n",
"honors\n",
"arrest\n",
"ict\n",
"ht\n",
"rh\n",
"roster\n",
"ooo\n",
"nhs\n",
"ste\n",
"hart\n",
"trance\n",
"notion\n",
"oc\n",
"arctic\n",
"treo\n",
"cia\n",
"haiti\n",
"ears\n",
"neo\n",
"cons\n",
"sonic\n",
"cheers\n",
"nat\n",
"cn\n",
"trio\n",
"rn\n",
"ser\n",
"ascii\n",
"trash\n",
"tier\n",
"cite\n",
"hose\n",
"saints\n",
"str\n",
"tenant\n",
"tattoo\n",
"tar\n",
"soc\n",
"sheer\n",
"eh\n",
"cohen\n",
"sie\n",
"acre\n",
"chen\n",
"hc\n",
"rca\n",
"satin\n",
"chan\n",
"tent\n",
"nathan\n",
"cos\n",
"ro\n",
"hans\n",
"sans\n",
"irs\n",
"casio\n",
"ana\n",
"onion\n",
"sao\n",
"scenic\n",
"hh\n",
"annie\n",
"asn\n",
"acne\n",
"ant\n",
"eos\n",
"raises\n",
"heath\n",
"sn\n",
"issn\n",
"sas\n",
"accent\n",
"sorts\n",
"hint\n",
"ate\n",
"io\n",
"tract\n",
"shine\n",
"casa\n",
"rosa\n",
"hash\n",
"cas\n",
"cio\n",
"tions\n",
"isa\n",
"tear\n",
"ata\n",
"nest\n",
"nato\n",
"stan\n",
"tooth\n",
"ratios\n",
"hansen\n",
"crest\n",
"tahoe\n",
"heater\n",
"cannon\n",
"ntsc\n",
"sic\n",
"cir\n",
"isaac\n",
"seas\n",
"ira\n",
"sen\n",
"enters\n",
"soa\n",
"neon\n",
"notre\n",
"choir\n",
"cho\n",
"resist\n",
"theta\n"
]
}
],
"source": [
"for s in df[0]: # for each word in the list...\n",
" if check(s) == True and len(s) < 7: # if it only contains common letters and is under seven letters long...\n",
" print(s) # print that word"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"&nbsp;\n",
"\n",
"Quite the list we have here!\n",
"\n",
"**Now let's store them in a list if they also only have six letters or less:**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"reading_list = [] # empty list variable"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"for s in df[0]:\n",
" if check(s) == True and len(s) < 7: # checks if only the ten chosen letters are present and if it's six words or less\n",
" reading_list.append(s) # adds word to the list if it meets both criteria"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"&nbsp;\n",
"\n",
"**The final code blocks below will write the word list to a text file:**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"the', 'to', 'a', 'in', 'is', 'on', 'that', 'this', 'i', 'it', 'not', 'or', 'are', 'at', 'as', 'an', 'can', 'has', 'search', 'one', 'other', 'no', 'site', 'he', 'their', 'there', 'see', 'so', 'his', 'here', 'c', 'e', 's', 'these', 'its', 'than', 'state', 'into', 'n', 're', 'her', 't', 'then', 'each', 'she', 'r', 'set', 'center', 'store', 'those', 'car', 'states', 'area', 'o', 'case', 'care', 'three', 'h', 'access', 'north', 'art', 'since', 'rate', 'sites', 'non', 'teen', 'too', 'estate', 'note', 'action', 'start', 'series', 'air', 'hot', 'cost', 'test', 'cart', 'san', 'street', 'recent', 'stores', 'act', 'rates', 'create', 'east', 'ii', 'ca', 'oct', 'china', 'star', 'areas', 'rss', 'enter', 'share', 'net', 'co', 'notice', 'once', 'others', 'cars', 'short', 'arts', 'et', 'st', 'costs', 'either', 'centre', 'tech', 'en', 'heart', 'choose', 'error', 'sort', 'cases', 'none', 'chat', 'near', 'oh', 'shoes', 'notes', 'cash', 'seen', 'screen', 'soon', 'across', 'season', 'casino', 'cross', 'rather', 'career', 'teens', 'sat', 'nice', 'score', 'sent', 'choice', 'hi', 'artist', 'asian', 'inn', 'cnet', 'inc', 'cancer', 'reason', 'sea', 'anti', 'earth', 'hair', 'cities', 'tree', 'ie', 'horse', 'stars', 'est', 'son', 'iii', 'senior', 'entire', 'asia', 'int', 'rest', 'hit', 'sense', 'race', 'etc', 'core', 'sets', 'rent', 'host', 'ohio', 'sector', 'coast', 'hear', 'ten', 'hits', 'th', 'cat', 'na', 'chris', 'os', 'nation', 'sheet', 'resort', 'chance', 'stone', 'tests', 'root', 'ice', 'shot', 'nc', 'scott', 'sec', 'canon', 'chair', 'shirt', 'sc', 'heat', 'nor', 'santa', 'se', 'saint', 'rose', 'errors', 'ac', 'rich', 'ar', 'sa', 'hire', 'ones', 'corner', 'chain', 'reach', 'inch', 'chart', 'cc', 'shirts', 'senate', 'ct', 'icon', 'cast', 'stats', 'hr', 'iron', 'ne', 'ocean', 'train', 'con', 'nt', 'es', 'cent', 'secret', 'aa', 'assets', 'assist', 'rare', 'rise', 'static', 'scene', 'eat', 'seat', 'ann', 'soccer', 'ch', 'christ', 'inches', 'rs', 'shares', 'cisco', 'tea', 'trees', 'easier', 'src', 'nine', 'eric', 'ratio', 'rain', 'onto', 'tennis', 'stress', 'ss', 'irish', 'acc', 'charts', 'tn', 'noise', 'sister', 'ce', 'coach', 'hat', 'cheats', 'iran', 'costa', 'acts', 'cotton', 'starts', 'scores', 'nh', 'rear', 'ia', 'ha', 'ea', 'chosen', 'sarah', 'hate', 'rice', 'raise', 'iso', 'catch', 'sir', 'earn', 'const', 'insert', 'res', 'sit', 'char', 'shots', 'crisis', 'treat', 'cs', 'echo', 'sheets', 'teach', 'nasa', 'si', 'css', 'threat', 'anne', 'asset', 'scan', 'sci', 'sin', 'cr', 'ee', 'inner', 'tone', 'ethics', 'stereo', 'taste', 'cache', 'er', 'seats', 'era', 'honor', 'cheese', 'coins', 'horror', 'shoe', 'ethnic', 'ran', 'actor', 'sr', 'nr', 'horses', 'thin', 'harris', 'chairs', 'sierra', 'cats', 'tr', 'ron', 'hist', 'crash', 'inter', 'te', 'sean', 'tion', 'hence', 'ear', 'tie', 'ian', 'ra', 'rc', 'rico', 'cst', 'ceo', 'ec', 'ross', 'anna', 'throat', 'sri', 'toe', 'trans', 'acres', 'nec', 'ease', 'arena', 'ri', 'rt', 'sensor', 'thai', 'scenes', 'icons', 'roses', 'chest', 'shorts', 'ah', 'tones', 'hearts', 'ns', 'carter', 'sons', 'hrs', 'ta', 'shoot', 'assess', 'stones', 'roots', 'shore', 'ieee', 'ho', 'sh', 'ae', 'titans', 'herein', 'rio', 'hs', 'hero', 'ai', 'ot', 'arc', 'hosts', 'coat', 'rica', 'actors', 'ion', 'ic', 'terror', 'intro', 'ent', 'ts', 'aaron', 'trace', 'ncaa', 'intent', 'tt', 'tee', 'hats', 'sharon', 'rr', 'titten', 'ace', 'tons', 'honest', 'chi', 'chase', 'athens', 'seo', 'nissan', 'ins', 'norton', 'tc', 'corn', 'tin', 'heroes', 'ir', 'ties', 'rat', 'ranch', 'toner', 'nose', 'thesis', 'cents', 'ti', 'sees', 'aaa', 'oo', 'coin', 'arch', 'ni', 'thats', 'asin', 'reset', 'tri', 'nn', 'chains', 'noon', 'cheat', 'teeth', 'tan', 'races', 'hon', 'attach', 'chose', 'nascar', 'tears', 'oasis', 'ist', 'cnn', 'tire', 'strain', 'scsi', 'inns', 'ash', 'easter', 'ci', 'nano', 'retain', 'chaos', 'rats', 'anchor', 'stat', 'thee', 'rec', 'ciao', 'ton', 'hints', 'oe', 'techno', 'cant', 'trains', 'arise', 'irc', 'sara', 'chess', 'oscar', 'strict', 'cet', 'tries', 'acer', 'ons', 'reno', 'horn', 'tires', 'retro', 'ati', 'rna', 'scotia', 'eco', 'honors', 'arrest', 'ict', 'ht', 'rh', 'roster', 'ooo', 'nhs', 'ste', 'hart', 'trance', 'notion', 'oc', 'arctic', 'treo', 'cia', 'haiti', 'ears', 'neo', 'cons', 'sonic', 'cheers', 'nat', 'cn', 'trio', 'rn', 'ser', 'ascii', 'trash', 'tier', 'cite', 'hose', 'saints', 'str', 'tenant', 'tattoo', 'tar', 'soc', 'sheer', 'eh', 'cohen', 'sie', 'acre', 'chen', 'hc', 'rca', 'satin', 'chan', 'tent', 'nathan', 'cos', 'ro', 'hans', 'sans', 'irs', 'casio', 'ana', 'onion', 'sao', 'scenic', 'hh', 'annie', 'asn', 'acne', 'ant', 'eos', 'raises', 'heath', 'sn', 'issn', 'sas', 'accent', 'sorts', 'hint', 'ate', 'io', 'tract', 'shine', 'casa', 'rosa', 'hash', 'cas', 'cio', 'tions', 'isa', 'tear', 'ata', 'nest', 'nato', 'stan', 'tooth', 'ratios', 'hansen', 'crest', 'tahoe', 'heater', 'cannon', 'ntsc', 'sic', 'cir', 'isaac', 'seas', 'ira', 'sen', 'enters', 'soa', 'neon', 'notre', 'choir', 'cho', 'resist', 'theta\""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reading_list = str(reading_list)\n",
"reading_list = reading_list[2:-2]\n",
"reading_list"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"reading_list = reading_list.replace(\"', '\", \"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"output = reading_list\n",
"file = open(\"reading_list.txt\",\"w\")\n",
"file.write(str(output))\n",
"file.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Final Thoughts\n",
"\n",
"Overall, this is a very simple thing to pull off. The only issues with the resulting output that I have so far break down into the following:\n",
"\n",
"* Some of the listed words are just letter combinations that don't sound out as any word ('c', 'e', and 's' for instance)\n",
"* Sorting the letters by long vs short version of vowels might be helpful (\"state\" vs \"stat\"), more code could pull this off\n",
"* There are some letter combos that have very odd pronunciation rules, like the \"tio\" in \"ratio\" or \"nation\"\n",
"\n",
"The next steps from here are generating short 3-5 word sentences that only use words from the resulting list, and deciding how and when to introduce new letters and corresponding words. Even if there were only 25 useful nouns, verbs, and adjectives on the list (75 total), that would still mean roughly 15,000 simple sentences are possible from this list!\n",
"\n",
"Existing Python NLP libraries could probably pull this off. Stay tuned..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment