Skip to content

Instantly share code, notes, and snippets.

@sandinmyjoints
Last active June 3, 2018 06:09
Show Gist options
  • Save sandinmyjoints/25bcbc2b4099fb9816564a1d55eb1480 to your computer and use it in GitHub Desktop.
Save sandinmyjoints/25bcbc2b4099fb9816564a1d55eb1480 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"# The missing translations problem\n",
"\n",
"When a user thinks of a word in the language they know, and they want to \"learn\"\n",
"that word in the language they are learning, what can we offer them to add to\n",
"their list?\n",
"\n",
"Manifestations:\n",
"\n",
"### As search result\n",
"- I know the word yellow in English.\n",
"- I want to learn it in Spanish.\n",
"- So I go to http://www.spanishdict.com/translate/yellow\n",
"- I see a button \"Add word to list\" and I click it.\n",
"\n",
"### As vocab list search\n",
"- I know the word yellow in English.\n",
"- I want to learn it in Spanish.\n",
"- So I go to a learningLang Spanish list I own: https://www.spanishdict.com/lists/342/williams-list\n",
"- I click Add new word\n",
"- I type \"yellow\" into the search box.\n",
"\n",
"How bad is the missing translations problem?\n",
"\n",
"- It is somewhat hidden when adding words to a list using autosearch, because\n",
" unless you have a results page open to compare, you don't know what's not\n",
" being included in the suggestions.\n",
"\n",
"- It might be more apparent using Add word to list button on results pages,\n",
" because users will be able to see that either there is nothing available to\n",
" add to their list, or if some options are available, they might not include\n",
" some of the translations that they are looking at right there on the result\n",
" page.\n",
"\n",
"- The queries below will try to quantify the number and proportion of entries on\n",
" which this problem exists."
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"First, connect to your local database. (Now's a good time to pull a recent\n",
"database dump if yours is old, then come back and reload this notebook.)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"execute-time": [
"2018-05-18T22:18:52.680102Z",
"2018-05-18T22:18:52.751290Z"
],
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'Connected: sd@sd_prod'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%reload_ext sql\n",
"from dotenv import load_dotenv\n",
"from pathlib import Path\n",
"env_path = Path('.') / '.env'\n",
"load_dotenv(dotenv_path=env_path)\n",
"\n",
"%reload_ext sql\n",
"import os\n",
"user = os.getenv('SD_USER') or 'sd'\n",
"password = os.getenv('SD_PASSWORD') or 'sd_password'\n",
"host = os.getenv('SD_HOST') or '127.0.0.1'\n",
"port = os.getenv('SD_PORT') or '3311'\n",
"connection_string = \"mysql+mysqlconnector://{user}:{password}@{host}:{port}/sd_prod?charset=utf8\".format(user=user, password=password, host=host, port=port)\n",
"%sql $connection_string\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"So that we don't swamp this notebook with ten of thousands of results, we'll\n",
"turn on some automatic limits. Feel free to adjust these, or copy the SQL\n",
"directly into the SQL client of your choice and run without limits to see the\n",
"entire result set. Or add appropriate `where source = 'X'` clauses to look up\n",
"specific words to see how they do."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"execute-time": [
"2018-05-18T22:18:55.801100Z",
"2018-05-18T22:18:55.829983Z"
],
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"%config SqlMagic.autolimit = 20\n",
"%config SqlMagic.displaylimit = 20\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"First question: how many published English Neodict entries (aka headwords) are\n",
"there?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"execute-time": [
"2018-05-18T22:18:57.935598Z",
"2018-05-18T22:18:58.065685Z"
],
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * mysql+mysqlconnector://sd:***@127.0.0.1:3311/sd_prod?charset=utf8\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/plain": [
"[(50644,)]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql select count(*) as count from dictionary_en where dictionary_en.neodict_status = 'P';"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"Next question: how many of these entries don't have even one quickdef that is a\n",
"Spanish Neodict headword?\n",
"\n",
"Some entries have only quickdefs that are not themselves published Neodict\n",
"entries: they might not be entries at all, or they might be non-Neodict entries.\n",
"In both those cases, we have nothing to offer the user who wants to learn\n",
"translation of the original headword into their learning lang.\n",
"\n",
"So for each headword, find the number of translations of opposite lang headwords\n",
"for its quickdefs, and filter to only the ones where that number is zero."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * mysql+mysqlconnector://sd:***@127.0.0.1:3311/sd_prod?charset=utf8\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"25012 rows affected.\n",
"Returning data to local variable result\n"
]
}
],
"source": [
"%%sql result <<\n",
"select en_side.source as en_headword,\n",
"count(es_side.translation) as total_num_translations_of_its_quickdefs\n",
"from dictionary_en as en_side\n",
"inner join dictionarysd_en_sense en_senses on en_senses.word_id = en_side.id\n",
"inner join dictionarysd_en_translation en_translations on en_translations.sense_id = en_senses.id\n",
"left join (\n",
" select d.source, t.translation from dictionary_es d\n",
" inner join dictionarysd_es_sense s on s.word_id = d.id\n",
" inner join dictionarysd_es_translation t on t.sense_id = s.id\n",
") as es_side on es_side.source = en_translations.translation\n",
"where en_side.neodict_status in ('P')\n",
"and en_translations.is_quick_translation = 1\n",
"group by en_headword\n",
"having total_num_translations_of_its_quickdefs = 0\n",
";\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------------------------------------+-----------------------------------------+\n",
"|en_headword | total_num_translations_of_its_quickdefs |\n",
"+--------------------------------------------------+-----------------------------------------+\n",
"|\"How the Grinch Stole Christmas by Dr. seuss' | 0 |\n",
"|\"The Adventures of Tom Sawyer by Mark Twain' | 0 |\n",
"|'til death do us part | 0 |\n",
"|'tis the season | 0 |\n",
"|-able | 0 |\n",
"|-aided | 0 |\n",
"|-ar | 0 |\n",
"|-ar ending | 0 |\n",
"|-atic | 0 |\n",
"|-ation | 0 |\n",
"|-atory | 0 |\n",
"|-cooled | 0 |\n",
"|-dimensional | 0 |\n",
"|-er | 0 |\n",
"|-er verbs | 0 |\n",
"|-fic | 0 |\n",
"|-handed | 0 |\n",
"|-handled | 0 |\n",
"|-hearted | 0 |\n",
"|-ic | 0 |\n",
"+--------------------------------------------------+-----------------------------------------+\n"
]
}
],
"source": [
"def format_and_print_result_table(result_table):\n",
" result_table.pretty.align = 'l'\n",
" rendered = result_table.__str__()\n",
" import re\n",
" RE_BYTE_STRINGS = re.compile(r\" b['\\\"](.+?)['\\\"] \")\n",
" # Some columns come back as byte strings because they use collate utf8_bin.\n",
" # This re catches them and prints improves their string repr. TODO: Decode the\n",
" # captured utf-8 byte strings to strings.\n",
" formatted_output = re.sub(RE_BYTE_STRINGS, '\\g<1> ', rendered)\n",
" print(formatted_output)\n",
"\n",
"format_and_print_result_table(result)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"These are some strange looking headwords. That's because they are ordered\n",
"lexicographically, so punctuation shows up early, and we are limiting the\n",
"results, so we're just seeing the ones that sort near the top. Let's add an\n",
"offset that jumps us right into the middle of the result set. We'll also pretty\n",
"print the results."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"%config SqlMagic.autolimit = None\n",
"%config SqlMagic.displaylimit = 50"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"execute-time": [
"2018-05-18T22:19:07.925588Z",
"2018-05-18T22:19:18.604786Z"
],
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * mysql+mysqlconnector://sd:***@127.0.0.1:3311/sd_prod?charset=utf8\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"50 rows affected.\n",
"Returning data to local variable result\n"
]
}
],
"source": [
"%%sql result <<\n",
"select en_side.source as en_headword,\n",
"count(es_side.translation) as total_num_translations_of_its_quickdefs\n",
"from dictionary_en as en_side\n",
"inner join dictionarysd_en_sense en_senses on en_senses.word_id = en_side.id\n",
"inner join dictionarysd_en_translation en_translations on en_translations.sense_id = en_senses.id\n",
"left join (\n",
" select d.source, t.translation from dictionary_es d\n",
" inner join dictionarysd_es_sense s on s.word_id = d.id\n",
" inner join dictionarysd_es_translation t on t.sense_id = s.id\n",
") as es_side on es_side.source = en_translations.translation\n",
"where en_side.neodict_status in ('P')\n",
"and en_translations.is_quick_translation = 1\n",
"group by en_headword\n",
"having total_num_translations_of_its_quickdefs = 0\n",
"limit 50 offset 13700\n",
";\n"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"execute-time": [
"2018-05-18T22:19:18.614318Z",
"2018-05-18T22:19:18.636541Z"
],
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------------------+-----------------------------------------+\n",
"|en_headword | total_num_translations_of_its_quickdefs |\n",
"+-----------------------+-----------------------------------------+\n",
"|interrelated | 0 |\n",
"|intersectionality | 0 |\n",
"|interstate | 0 |\n",
"|interstellar | 0 |\n",
"|intimidating | 0 |\n",
"|into the woods | 0 |\n",
"|intramural | 0 |\n",
"|intramuscular | 0 |\n",
"|intrinsic | 0 |\n",
"|intrinsically | 0 |\n",
"|introduce myself | 0 |\n",
"|introspection | 0 |\n",
"|invaluable | 0 |\n",
"|invasive | 0 |\n",
"|invasive species | 0 |\n",
"|inventive | 0 |\n",
"|invertebrate | 0 |\n",
"|invigorating | 0 |\n",
"|inviolable | 0 |\n",
"|invisibility | 0 |\n",
"|invite her | 0 |\n",
"|inviter | 0 |\n",
"|invocation timeout | 0 |\n",
"|invoice date | 0 |\n",
"|involuntary | 0 |\n",
"|involved in | 0 |\n",
"|irascible | 0 |\n",
"|iridescent | 0 |\n",
"|iron deficiency | 0 |\n",
"|iron fist | 0 |\n",
"|iron ore | 0 |\n",
"|ironically | 0 |\n",
"|ironing board | 0 |\n",
"|irrefutable | 0 |\n",
"|irregular verbs | 0 |\n",
"|irregulars | 0 |\n",
"|irrelevant | 0 |\n",
"|irremediable | 0 |\n",
"|irreplaceable | 0 |\n",
"|irresistible | 0 |\n",
"|irrespective | 0 |\n",
"|irritability | 0 |\n",
"|irritation | 0 |\n",
"|is considered | 0 |\n",
"|is in love | 0 |\n",
"|is it cold today | 0 |\n",
"|is it true | 0 |\n",
"|is my favorite | 0 |\n",
"|is older than | 0 |\n",
"|is she | 0 |\n",
"+-----------------------+-----------------------------------------+\n"
]
}
],
"source": [
"format_and_print_result_table(result)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"Some of these results are phrases, but a number of them are plain words. -\n",
"interstellar. On http://www.spanishdict.com/translate/interstellar, we tell you\n",
"its quickdef is \"interestelar, intersideral\". The first one is a Collins entry,\n",
"and the second is not found. Neither one has a Neodict entry. So if an English\n",
"speaker learning Spanish plugs \"interstellar\" into their vocab list autosearch,\n",
"wanting to learn the Spanish translation of it, they will see \"no results found\"\n",
"(try it), nothing to add to their list. Same thing if they type \"interestelar\"\n",
"directly.\n",
"\n",
"- From this set of results, other words that are like this include interrelated,\n",
" intimidating, intramural, intrinsic, invaluable, invisibility.\n",
"\n",
"- irresistible is a different case. On\n",
"http://www.spanishdict.com/translate/irresistible, we tell you its quickdef is\n",
"\"irresistable\" (es) -- but there is no Spanish Neodict entry for \"irresistible\".\n",
"What happens if you try to add irrestible to a vocab list? \"arrollador\" is\n",
"suggested. That's because irresistible is a translation of the third sense of\n",
"arrollador -- but arrollador doesn't appear anywhere on the results page for\n",
"irresistible. Due to asymmetry between the sides of the dictionary, some\n",
"words may still have translations to offer even when these queries don't find\n",
"them.\n",
"\n",
"- From this set of results, other words that are like this include irrelevant, ...\n",
"\n",
"And there's nothing special about this particular group of results. Spot\n",
"checking other ranges yields roughly similar numbers of words like these."
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"What about the ~25K EN headwords that did have quickdefs that are ES entries\n",
"that have translations?\n",
"\n",
"If any of these translations are equal to the headword, then we have something\n",
"to offer the user who wants to add it to their learning lang list.\n",
"\n",
"For each headword, sum the number of translations are equal to the headword\n",
"(account for \"to \" + headword which is the way English verbs are authored as\n",
"tranlsations of Spanish words). List the rows that have zero of them."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * mysql+mysqlconnector://sd:***@127.0.0.1:3311/sd_prod?charset=utf8\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"7066 rows affected.\n",
"Returning data to local variable result\n"
]
}
],
"source": [
"%%sql result <<\n",
"select en_words_with_translations.id, en_headword, en_headword_to, es_side.source, es_side.translation,\n",
"sum(case when (en_headword = es_side.translation or en_headword_to = es_side.translation) then 1 else 0 end) as num_matches\n",
" from\n",
"(\n",
" select en_side.id, en_side.source as en_headword, concat(\"to \", en_side.source) as en_headword_to,\n",
" count(es_side.translation) as total_num_translations_of_its_quickdefs\n",
" from dictionary_en as en_side\n",
" inner join dictionarysd_en_sense en_senses on en_senses.word_id = en_side.id\n",
" inner join dictionarysd_en_translation en_translations on en_translations.sense_id = en_senses.id\n",
" inner join (\n",
" select d.source, t.translation from dictionary_es d\n",
" inner join dictionarysd_es_sense s on s.word_id = d.id\n",
" inner join dictionarysd_es_translation t on t.sense_id = s.id\n",
" ) as es_side on es_side.source = en_translations.translation\n",
" where en_side.neodict_status in ('P')\n",
" and en_translations.is_quick_translation = 1\n",
" group by en_headword\n",
" having total_num_translations_of_its_quickdefs > 0)\n",
"as en_words_with_translations\n",
"inner join dictionarysd_en_sense en_senses on en_senses.word_id = en_words_with_translations.id\n",
"inner join dictionarysd_en_translation en_translations on en_translations.sense_id = en_senses.id\n",
"left join\n",
" (select d.source,\n",
" t.translation\n",
" from dictionary_es d\n",
" inner join dictionarysd_es_sense s on s.word_id = d.id\n",
" inner join dictionarysd_es_translation t on t.sense_id = s.id)\n",
"as es_side on es_side.source = en_translations.translation\n",
"group by en_headword\n",
"having num_matches = 0\n",
";\n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------+-----------------------------------+--------------------------------------+-------------------------------------------+---------------------------------------+-------------+\n",
"| id |en_headword |en_headword_to |source | translation | num_matches |\n",
"+--------+-----------------------------------+--------------------------------------+-------------------------------------------+---------------------------------------+-------------+\n",
"| 156372 |'til death |to 'til death |hasta la muerte | to death | 0 |\n",
"| 129103 |'tween |to 'tween |entre | between | 0 |\n",
"| 137 |-ably |to -ably |-mente | -ly | 0 |\n",
"| 118998 |-ard |to -ard |-acho | | 0 |\n",
"| 131796 |-hugging |to -hugging |ajustado | tight | 0 |\n",
"| 123917 |-less |to -less |sin | without | 0 |\n",
"| 56405 |-sided |to -sided |de lado | sideways | 0 |\n",
"| 127705 |-skinned |to -skinned |de piel | fur | 0 |\n",
"| 131772 |-wright |to -wright |-ero | -er | 0 |\n",
"| 138126 |5 years ago |to 5 years ago |hace cinco a\\xc3\\xb1os | five years ago | 0 |\n",
"| 118756 |AK |to AK |Alaska | Alaska | 0 |\n",
"| 3248 |ARA |to ARA |AA | AA | 0 |\n",
"| 144400 |Academy Award |to Academy Award |\\xc3\\x93scar | Oscar | 0 |\n",
"| 1688 |Allies |to Allies |Aliados | the Allies | 0 |\n",
"| 159238 |Almighty |to Almighty |Todopoderoso | the Almighty | 0 |\n",
"| 133639 |Anna |to Anna |Ana | Ann | 0 |\n",
"| 159249 |Apocalypse |to Apocalypse |Apocalipsis | Revelations | 0 |\n",
"| 146642 |Are you alright? |to Are you alright? |\\xc2\\xbfEst\\xc3\\xa1s bien? | Are you OK? | 0 |\n",
"| 156347 |Are you fine? |to Are you fine? |\\xc2\\xbfEst\\xc3\\xa1s bien? | Are you OK? | 0 |\n",
"| 153873 |Are you good? |to Are you good? |\\xc2\\xbfEst\\xc3\\xa1s bien? | Are you OK? | 0 |\n",
"| 137180 |Are you kidding me? |to Are you kidding me? |\\xc2\\xbfQu\\xc3\\xa9 dices? | What are you saying? | 0 |\n",
"| 130397 |Are you ok? |to Are you ok? |\\xc2\\xbfEst\\xc3\\xa1s bien? | Are you OK? | 0 |\n",
"| 130895 |Are you okay? |to Are you okay? |\\xc2\\xbfEst\\xc3\\xa1s bien? | Are you OK? | 0 |\n",
"| 155317 |Are you well? |to Are you well? |\\xc2\\xbfEst\\xc3\\xa1s bien? | Are you OK? | 0 |\n",
"| 50847 |Armageddon |to Armageddon |Apocalipsis | Revelations | 0 |\n",
"| 153417 |At what time do you go to bed? |to At what time do you go to bed? |\\xc2\\xbfA qu\\xc3\\xa9 hora te acuestas? | What time do you go to bed? | 0 |\n",
"| 50924 |BA |to BA |licenciatura | degree | 0 |\n",
"| 51029 |BBQ |to BBQ |barbacoa | barbecued meat | 0 |\n",
"| 119400 |BF |to BF |mejor amigo | best friend | 0 |\n",
"| 150102 |BFF |to BFF |mejor amigo | best friend | 0 |\n",
"| 119762 |BTW |to BTW |por cierto | by the way | 0 |\n",
"| 119265 |Bart |to Bart | None | None | 0 |\n",
"| 119581 |Boche |to Boche | None | None | 0 |\n",
"| 7100 |Briton |to Briton |brit\\xc3\\xa1nico | British | 0 |\n",
"| 119721 |Bro. |to Bro. |hermano | brother | 0 |\n",
"| 7682 |CA |to CA |California | California | 0 |\n",
"| 120632 |CP |to CP |buj\\xc3\\xada | spark plug | 0 |\n",
"| 140801 |CT scan |to CT scan |tomograf\\xc3\\xada | tomography | 0 |\n",
"| 162752 |Caitlin |to Caitlin |Catalina | Catherine | 0 |\n",
"| 150460 |Can I go to the restroom? |to Can I go to the restroom? |\\xc2\\xbfPuedo ir al ba\\xc3\\xb1o? | Can I go to the bathroom? | 0 |\n",
"| 135782 |Can I have the bill, please? |to Can I have the bill, please? |\\xc2\\xbfMe trae la cuenta, por favor? | Can I get the bill, please? | 0 |\n",
"| 135560 |Can I use the bathroom? |to Can I use the bathroom? |\\xc2\\xbfPuedo ir al ba\\xc3\\xb1o? | Can I go to the bathroom? | 0 |\n",
"| 134575 |Can we have the bill, please? |to Can we have the bill, please? |\\xc2\\xbfNos trae la cuenta, por favor? | Could you bring us the check, please? | 0 |\n",
"| 8324 |Carl |to Carl |Carlos | Charles | 0 |\n",
"| 144682 |Christmas time |to Christmas time |Navidad | Christmas | 0 |\n",
"| 159339 |Chuck |to Chuck |Carlitos | Charlie | 0 |\n",
"| 150260 |Cinderella story |to Cinderella story |cuento de hadas | fairy tale | 0 |\n",
"| 10664 |Columbian |to Columbian |colombino | of Columbus | 0 |\n",
"| 11490 |Congo |to Congo |Congo | the Congo | 0 |\n",
"| 159394 |Daisy |to Daisy |Margarita | Margaret | 0 |\n",
"+--------+-----------------------------------+--------------------------------------+-------------------------------------------+---------------------------------------+-------------+\n"
]
}
],
"source": [
"format_and_print_result_table(result)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"Examples: bedside, behold"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"So out of 50644 published English Neodict entries, 25012 had no quickdefs that\n",
"are Spanish headwords that have translations, and 7065 had quickdefs that are\n",
"Spanish headwords that have translations but none of the translations were the\n",
"original English headword. What percent is that?"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"63.3"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"round(((25012 + 7065) / 50644.0) * 100, 1)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"You can try these out currently by typing these words in vocab list autosearch:\n",
"they will not have results. I think this will become more apparent when we start\n",
"to place the \"Add word to list\" button on results pages.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"And this is not accounting for missing translations among headwords that do appear as\n",
"translations for at least some of their quickdefs.\n",
"\n",
"For example, take the word \"movies\". We give five translations into Spanish:\n",
"\n",
"![5 translations](https://cldup.com/w-wSHJTWJc.png)\n",
"\n",
"But only offer one of them to add to your vocab list:\n",
"\n",
"![only 1](https://cldup.com/e0kTxfdxr4.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"Let's do this for Spanish. How many published Spanish Neodict entries (aka headwords) are\n",
"there?\n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * mysql+mysqlconnector://sd:***@127.0.0.1:3311/sd_prod?charset=utf8\n",
"1 rows affected.\n"
]
},
{
"data": {
"text/plain": [
"[(42061,)]"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql select count(*) as count from dictionary_es where dictionary_es.neodict_status = 'P';\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"Next question: how many of these entries don't have even one quickdef that is an\n",
"English Neodict headword?\n"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * mysql+mysqlconnector://sd:***@127.0.0.1:3311/sd_prod?charset=utf8\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"20545 rows affected.\n",
"Returning data to local variable result\n"
]
}
],
"source": [
"%%sql result <<\n",
"select es_side.source as es_headword,\n",
"count(en_side.translation) as total_num_translations_of_its_quickdefs\n",
"from dictionary_es as es_side\n",
"inner join dictionarysd_es_sense es_senses on es_senses.word_id = es_side.id\n",
"inner join dictionarysd_es_translation es_translations on es_translations.sense_id = es_senses.id\n",
"left join (\n",
" select d.source, t.translation from dictionary_en d\n",
" inner join dictionarysd_en_sense s on s.word_id = d.id\n",
" inner join dictionarysd_en_translation t on t.sense_id = s.id\n",
") as en_side on en_side.source = es_translations.translation\n",
"where es_side.neodict_status in ('P')\n",
"and es_translations.is_quick_translation = 1\n",
"group by es_headword\n",
"having total_num_translations_of_its_quickdefs = 0\n",
";\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"OK, what about the ~22K ES headwords that *did* have quickdefs that are EN\n",
"entries that have translations? How many of those EN entries have translations\n",
"that are equal to the headword, so we have something to offer the user who wants to\n",
"add it to their learning lang list?\n"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" * mysql+mysqlconnector://sd:***@127.0.0.1:3311/sd_prod?charset=utf8\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"5440 rows affected.\n",
"Returning data to local variable result\n"
]
}
],
"source": [
"%%sql result <<\n",
"select es_words_with_translations.id, es_headword, en_side.source, en_side.translation,\n",
"sum(case when (es_headword = en_side.translation) then 1 else 0 end) as num_matches\n",
" from\n",
"(\n",
" select es_side.id, es_side.source as es_headword,\n",
" count(en_side.translation) as total_num_translations_of_its_quickdefs\n",
" from dictionary_es as es_side\n",
" inner join dictionarysd_es_sense es_senses on es_senses.word_id = es_side.id\n",
" inner join dictionarysd_es_translation es_translations on es_translations.sense_id = es_senses.id\n",
" inner join (\n",
" select d.source, t.translation from dictionary_en d\n",
" inner join dictionarysd_en_sense s on s.word_id = d.id\n",
" inner join dictionarysd_en_translation t on t.sense_id = s.id\n",
" ) as en_side on en_side.source = es_translations.translation\n",
" where es_side.neodict_status in ('P')\n",
" and es_translations.is_quick_translation = 1\n",
" group by es_headword\n",
" having total_num_translations_of_its_quickdefs > 0)\n",
"as es_words_with_translations\n",
"inner join dictionarysd_es_sense es_senses on es_senses.word_id = es_words_with_translations.id\n",
"inner join dictionarysd_es_translation es_translations on es_translations.sense_id = es_senses.id\n",
"left join\n",
" (select d.source,\n",
" t.translation\n",
" from dictionary_en d\n",
" inner join dictionarysd_en_sense s on s.word_id = d.id\n",
" inner join dictionarysd_en_translation t on t.sense_id = s.id)\n",
"as en_side on en_side.source = es_translations.translation\n",
"group by es_headword\n",
"having num_matches = 0\n",
";\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"So out of 42061 published Spanish Neodict entries, 20545 had no quickdefs that\n",
"are English headwords that have translations, and 5440 had quickdefs that are\n",
"English headwords that have translations but none of the translations were the\n",
"original Spanish headword. What percent is that?\n"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"autoscroll": false,
"collapsed": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"61.8"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"round((20545 + 5440) / 42061 * 100, 1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"So we've seen that in both English and Spanish, in a substantial number of\n",
"cases, a user will see a headword-translation on our site that that they might\n",
"want to add to a vocab list, but be unable to do so, for the reasons lined above.\n",
"\n",
"The numbers aren't quite as bad as this because as we've seen, some of the words\n",
"coming up in these queries do appear as translations of other, different words\n",
"in the opposite language that are not the quickdef of the original word. And\n",
"plenty of these \"words\" are really phrase, many of them being long-tail queries\n",
"that few users would ever try to add to a list. Still, the problem remains that\n",
"many users may be frustrated when they attempt to learn a word and are unable to\n",
"do so, **particularly** when they might be able to see the translaton listed on\n",
"our own results page right in fronf of them!\n",
"\n",
"### Possible next steps\n",
"\n",
"- For each Neodict entry in a lang, search for it among the translations of\n",
"opposite lang headwords, to see how many never show up.\n",
"-"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
},
"name": "missing_translations1.ipynb"
},
"nbformat": 4,
"nbformat_minor": 2
}
jupyter==1.0.0
ipython-sql==0.3.9
mysql-connector-python==8.0.11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment