Noleli/vowel search.ipynb

## vowel search.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lawrence Szenes-Strauss posted the following question on [Facebook](https://www.facebook.com/groups/1071696109619922/permalink/1558249010964627/):\n",
    "\n",
    "> Who can come up with a short passage of Tanakh that:\n",
    "> 1. Contains all 12 distinct Masoretic vowel marks (qamats, patah, hataf patah, tsere, segol, hataf segol, hiriq, holam, shuruq, qubuts, hataf qamats, sheva) and\n",
    "> 2. Does not contain a shem kodesh.\n",
    "> \n",
    "> Looking to use it in an initial reading assessment for some students. Thanks!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "datadir = '../'\n",
    "vowelnames = {'qamats': u'\\u05B8', 'patah': u'\\u05B7', 'hataf patah': u'\\u05B2', 'tsere': u'\\u05B5', 'segol': u'\\u05B6', 'hataf segol': u'\\u05B1', 'hiriq': u'\\u05B4', 'holam': u'\\u05B9', 'shuruq': 'וּ', 'qubuts': u'\\u05BB', 'hataf qamats': u'\\u05B3', 'sheva': u'\\u05B0'}\n",
    "letters = 'אבגדהוזחטיכךלמםנןסעפףצץקרשת '"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "sfarim = ['bereshit', 'shmot', 'vayikra', 'bmidbar', 'dvarim']\n",
    "\n",
    "data = {} # so you can inspect manually if you want\n",
    "results = []\n",
    "\n",
    "for sefer in sfarim:\n",
    "    data[sefer] = {}\n",
    "    tree = ET.parse(datadir + sefer + '.xml')\n",
    "    root = tree.getroot() \n",
    "    prakim = root.findall('.//c')\n",
    "    for perek in prakim:\n",
    "        pereknum = int(perek.attrib['n'])\n",
    "        if pereknum not in data[sefer]: data[sefer][pereknum] = {}\n",
    "        psukim = perek.findall('v')\n",
    "        for pasuk in psukim:\n",
    "            pasuknum = int(pasuk.attrib['n'])\n",
    "            if pasuknum not in data[sefer][pereknum]:\n",
    "                data[sefer][pereknum][pasuknum] = {}\n",
    "            text = [w.text for w in pasuk if w.tag=='w' or w.tag=='q']\n",
    "            words = [''.join(list(filter(lambda c: c in letters, w))) for w in text]\n",
    "            vowels = re.findall(r'|'.join(vowelnames.values()), ' '.join(text)) # because shuruq is actually 2 chars\n",
    "            data[sefer][pereknum][pasuknum]['text'] = text\n",
    "            data[sefer][pereknum][pasuknum]['words'] = words\n",
    "            data[sefer][pereknum][pasuknum]['vowels'] = vowels\n",
    "\n",
    "            if all([v in vowels for v in vowelnames.values()]): results.append((sefer, pereknum, pasuknum))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('shmot', 32, 6), ('vayikra', 22, 3)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Turns out there are only two results in the Torah, so I'm not bothering to filter for *shem kodesh*.\n",
    "\n",
    "Manual inspection shows that **[Shmot 32:6](https://www.sefaria.org/Exodus.32.6?lang=he)** is the answer.\n",
    "\n",
    "If I were to add the rest of Tanakh (just a matter of downloading data from tanach.us), it might make sense to filter for *shem kodesh* automatically."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Lawrence Szenes-Strauss posted the following question on [Facebook](https://www.facebook.com/groups/1071696109619922/permalink/1558249010964627/):\n",
	"\n",
	"> Who can come up with a short passage of Tanakh that:\n",
	"> 1. Contains all 12 distinct Masoretic vowel marks (qamats, patah, hataf patah, tsere, segol, hataf segol, hiriq, holam, shuruq, qubuts, hataf qamats, sheva) and\n",
	"> 2. Does not contain a shem kodesh.\n",
	"> \n",
	"> Looking to use it in an initial reading assessment for some students. Thanks!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import xml.etree.ElementTree as ET\n",
	"import re"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"datadir = '../'\n",
	"vowelnames = {'qamats': u'\\u05B8', 'patah': u'\\u05B7', 'hataf patah': u'\\u05B2', 'tsere': u'\\u05B5', 'segol': u'\\u05B6', 'hataf segol': u'\\u05B1', 'hiriq': u'\\u05B4', 'holam': u'\\u05B9', 'shuruq': 'וּ', 'qubuts': u'\\u05BB', 'hataf qamats': u'\\u05B3', 'sheva': u'\\u05B0'}\n",
	"letters = 'אבגדהוזחטיכךלמםנןסעפףצץקרשת '"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"sfarim = ['bereshit', 'shmot', 'vayikra', 'bmidbar', 'dvarim']\n",
	"\n",
	"data = {} # so you can inspect manually if you want\n",
	"results = []\n",
	"\n",
	"for sefer in sfarim:\n",
	" data[sefer] = {}\n",
	" tree = ET.parse(datadir + sefer + '.xml')\n",
	" root = tree.getroot() \n",
	" prakim = root.findall('.//c')\n",
	" for perek in prakim:\n",
	" pereknum = int(perek.attrib['n'])\n",
	" if pereknum not in data[sefer]: data[sefer][pereknum] = {}\n",
	" psukim = perek.findall('v')\n",
	" for pasuk in psukim:\n",
	" pasuknum = int(pasuk.attrib['n'])\n",
	" if pasuknum not in data[sefer][pereknum]:\n",
	" data[sefer][pereknum][pasuknum] = {}\n",
	" text = [w.text for w in pasuk if w.tag=='w' or w.tag=='q']\n",
	" words = [''.join(list(filter(lambda c: c in letters, w))) for w in text]\n",
	" vowels = re.findall(r'\|'.join(vowelnames.values()), ' '.join(text)) # because shuruq is actually 2 chars\n",
	" data[sefer][pereknum][pasuknum]['text'] = text\n",
	" data[sefer][pereknum][pasuknum]['words'] = words\n",
	" data[sefer][pereknum][pasuknum]['vowels'] = vowels\n",
	"\n",
	" if all([v in vowels for v in vowelnames.values()]): results.append((sefer, pereknum, pasuknum))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[('shmot', 32, 6), ('vayikra', 22, 3)]"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"results"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Turns out there are only two results in the Torah, so I'm not bothering to filter for shem kodesh.\n",
	"\n",
	"Manual inspection shows that [Shmot 32:6](https://www.sefaria.org/Exodus.32.6?lang=he) is the answer.\n",
	"\n",
	"If I were to add the rest of Tanakh (just a matter of downloading data from tanach.us), it might make sense to filter for shem kodesh automatically."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}