Skip to content

Instantly share code, notes, and snippets.

@FavioVazquez
Last active March 31, 2024 04:39
Show Gist options
  • Save FavioVazquez/ba0c71c58f24164891c56ccda5cccd2d to your computer and use it in GitHub Desktop.
Save FavioVazquez/ba0c71c58f24164891c56ccda5cccd2d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: textstat in /Users/faviovazquez/anaconda3/lib/python3.7/site-packages (0.6.0)\n",
"Requirement already satisfied: pyphen in /Users/faviovazquez/anaconda3/lib/python3.7/site-packages (from textstat) (0.9.5)\n"
]
}
],
"source": [
"!pip install textstat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# English"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import textstat"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"textstat.set_lang(\"en\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"text = \"\"\"\n",
"Data science is the main focus of most sciences and studies right now, \n",
"it needs a lot of things like AI, programming, statistics, \n",
"business understanding, effective presentation skills and much more. \n",
"That's why it's not easy to understand or study. But we can do it, we are doing it.\n",
"Data science has become the standard solving problem framework for academia and \n",
"the industry and it's going to be like that for a while. But we need to remember \n",
"where we are coming from, who we are and where we are going.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"126"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count Syllables\n",
"textstat.syllable_count(text)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"91"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Lexicon count\n",
"textstat.lexicon_count(text, removepunct=True)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Sentence count\n",
"textstat.sentence_count(text)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"65.25"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Flesch Reading Ease formula\n",
"textstat.flesch_reading_ease(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Around 65 means this text has a \"standard\" difficuly to be read."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9.8"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Flesch-Kincaid Grade Level\n",
"textstat.flesch_kincaid_grade(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This means the text is very difficult to read. Best understood by university graduates. Which seems fine."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.76"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fog Scale (Gunning FOG Formula)\n",
"textstat.gunning_fog(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Meaning that a High school junior can read this."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.2"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# SMOG Index # Similar to FOG\n",
"textstat.smog_index(text)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.8"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Automated Readability Index\n",
"textstat.automated_readability_index(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Meaning that a Eleventh Grade student can read it."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"8.94"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Coleman-Liau Index\n",
"textstat.coleman_liau_index(text)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"10.7"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Linsear Write Formula\n",
"textstat.linsear_write_formula(text)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"7.54"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dale-Chall Readability Score\n",
"textstat.dale_chall_readability_score(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Meaning that an average 9th or 10th-grade student can read it."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'11th and 12th grade'"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Readability Consensus\n",
"textstat.text_standard(text, float_output=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Meaning that in general someone that has finished 11th or 12th grade could understand this piece."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6.08"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Time to read the text in seconds\n",
"textstat.reading_time(text)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# Run all at once\n",
"import inspect\n",
"funcs = [\"textstat.\" + inspect.getmembers(textstat, predicate=inspect.ismethod)[i][0] for i in range(1,28)]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"avg_character_per_word\n",
"4.64\n",
" \n",
"avg_letter_per_word\n",
"4.47\n",
" \n",
"avg_sentence_length\n",
"22.8\n",
" \n",
"avg_sentence_per_word\n",
"0.04\n",
" \n",
"avg_syllables_per_word\n",
"1.4\n",
" \n",
"char_count\n",
"422\n",
" \n",
"coleman_liau_index\n",
"8.94\n",
" \n",
"dale_chall_readability_score\n",
"7.54\n",
" \n",
"dale_chall_readability_score_v2\n",
"7.54\n",
" \n",
"difficult_words\n",
"16\n",
" \n",
"difficult_words_list\n",
"['data', 'programming', 'presentation', 'problem', 'industry', 'focus', 'framework', 'statistics', 'understanding', 'standard', 'doing', 'science', 'studies', 'solving', 'sciences', 'effective']\n",
" \n",
"flesch_kincaid_grade\n",
"9.8\n",
" \n",
"flesch_reading_ease\n",
"65.25\n",
" \n",
"gunning_fog\n",
"11.76\n",
" \n",
"letter_count\n",
"407\n",
" \n",
"lexicon_count\n",
"91\n",
" \n",
"linsear_write_formula\n",
"10.7\n",
" \n",
"lix\n",
"42.58\n",
" \n",
"polysyllabcount\n",
"8\n",
" \n",
"reading_time\n",
"6.08\n",
" \n",
"rix\n",
"4.5\n",
" \n",
"sentence_count\n",
"4\n",
" \n",
"set_lang\n",
"None\n",
" \n",
"smog_index\n",
"11.2\n",
" \n",
"spache_readability\n",
"5.5588379120879114\n",
" \n",
"syllable_count\n",
"126\n",
" \n",
"text_standard\n",
"11th and 12th grade\n",
" \n"
]
}
],
"source": [
"for elem in funcs:\n",
" method = eval(elem)\n",
" textstat.set_lang(\"en\")\n",
" print(elem.split(\".\")[1])\n",
" print(method(text))\n",
" print(\" \")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spanish - Español"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"text = \"\"\"\n",
"La ciencia de datos es el foco principal de la mayoría de las ciencias y estudios en este momento, \n",
"necesita muchas cosas como inteligencia artificial, programación, estadísticas, \n",
"comprensión del negocio, habilidades de presentación efectivas y mucho más. \n",
"Por eso no es fácil de entender o estudiar. Pero podemos hacerlo, lo estamos haciendo.\n",
"La ciencia de datos se ha convertido en el marco de resolución de \n",
"problemas estándar para la academia y la industria y va a ser así \n",
"por un tiempo. Pero debemos recordar de dónde venimos, \n",
"quiénes somos y hacia dónde vamos.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"textstat.set_lang(\"es\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Note: The only readibility function implemented is the Fernandez Huerta Readability Formula which is a variant of the Flesch Reading Ease formula"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"61.75"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"textstat.flesch_reading_ease(text)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6.92"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Time to read the text in seconds\n",
"textstat.reading_time(text)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['venimos',\n",
" 'resolución',\n",
" 'muchas',\n",
" 'estadísticas',\n",
" 'hacerlo',\n",
" 'dónde',\n",
" 'mucho',\n",
" 'pero',\n",
" 'estudios',\n",
" 'presentación',\n",
" 'ciencia',\n",
" 'datos',\n",
" 'comprensión',\n",
" 'mayoría',\n",
" 'negocio',\n",
" 'como',\n",
" 'vamos',\n",
" 'quiénes',\n",
" 'momento',\n",
" 'inteligencia',\n",
" 'programación',\n",
" 'industria',\n",
" 'habilidades',\n",
" 'convertido',\n",
" 'ciencias',\n",
" 'efectivas',\n",
" 'estamos',\n",
" 'marco',\n",
" 'estándar',\n",
" 'recordar',\n",
" 'cosas',\n",
" 'estudiar',\n",
" 'principal',\n",
" 'artificial',\n",
" 'fácil',\n",
" 'necesita',\n",
" 'hacia',\n",
" 'entender',\n",
" 'debemos',\n",
" 'academia',\n",
" 'tiempo',\n",
" 'para',\n",
" 'somos',\n",
" 'problemas',\n",
" 'haciendo',\n",
" 'foco',\n",
" 'podemos',\n",
" 'este']"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This works so-so in Spanish\n",
"textstat.difficult_words_list(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Check spelling"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: autocorrect in /Users/faviovazquez/anaconda3/lib/python3.7/site-packages (0.4.4)\n"
]
}
],
"source": [
"!pip install autocorrect"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# Here I'm misspelling :\n",
"# presentation as presentatio\n",
"# focus as focsu\n",
"# framework as framwork \n",
"text = \"\"\"\n",
"Data science is the main focsu of most sciences and studies right now, \n",
"it needs a lot of things like AI, programming, statistics, \n",
"business understanding, effective presentatio skills and much more. \n",
"That's why it's not easy to understand or study. But we can do it, we are doing it.\n",
"Data science has become the standard solving problem framwork for academia and \n",
"the industry and it's going to be like that for a while. But we need to remember \n",
"where we are coming from, who we are and where we are going.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"\\ndata science is the main focus of most sciences and studies right now, \\nit needs a lot of things like AI, programming, statistics, \\nbusiness understanding, effective presentation skills and much more. \\nThat's why it's not easy to understand or study. But we can do it, we are doing it.\\ndata science has become the standard solving problem framework for academia and \\nthe industry and it's going to be like that for a while. But we need to remember \\nwhere we are coming from, who we are and where we are going.\\n\""
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from autocorrect import Speller\n",
"\n",
"check = Speller(lang='en')\n",
"\n",
"check(text)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment