Skip to content

Instantly share code, notes, and snippets.

@willirath
Last active July 14, 2023 14:17
Show Gist options
  • Save willirath/ee78d101296339b97ae2cc5bd2337fd2 to your computer and use it in GitHub Desktop.
Save willirath/ee78d101296339b97ae2cc5bd2337fd2 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "c22818de-13c2-41b1-a4f0-edddcfee82e3",
"metadata": {},
"outputs": [],
"source": [
"# %pip install nltk pandas matplotlib numpy"
]
},
{
"cell_type": "markdown",
"id": "59b43f0e-bf57-4b14-ab8d-abb860248cb8",
"metadata": {},
"source": [
"# Stats about the use of articles in English and Portugese\n",
"\n",
"(With many pinches of salt....)"
]
},
{
"cell_type": "markdown",
"id": "2d07b9eb-cdd3-4bc1-b4dd-ef56a602181a",
"metadata": {},
"source": [
"## Imports, downloads\n",
"\n",
"Note we first import the downloader and only later import the corpora."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "37e4f03a-09bd-4be7-92a3-6bb99aebd822",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cda7a5ff-9c9a-478f-ad32-1ee64d843330",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import nltk"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7e7f7091-1740-417a-bb43-80401f1cb8dc",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package floresta to /home/jovyan/nltk_data...\n",
"[nltk_data] Package floresta is already up-to-date!\n",
"[nltk_data] Downloading package brown to /home/jovyan/nltk_data...\n",
"[nltk_data] Package brown is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.download(\"floresta\")\n",
"nltk.download(\"brown\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "408fd3fa-a93c-4518-a17b-7ce6bf4b8b3f",
"metadata": {},
"outputs": [],
"source": [
"from nltk.corpus import brown, floresta"
]
},
{
"cell_type": "markdown",
"id": "c5e9f115-3056-4b29-99b3-a236af0cbb86",
"metadata": {},
"source": [
"## Analysing the use of articles in English and Portugese\n",
"\n",
"The English _Brown_ corpus tags articles with `\"AT\"`, the Portugese _Floresta_ corpus uses `\"art\"`."
]
},
{
"cell_type": "markdown",
"id": "0d10d66c-2a7e-49b5-b5a1-a2423595fd80",
"metadata": {},
"source": [
"### Let's quantify the frequency of articles."
]
},
{
"cell_type": "markdown",
"id": "6260cd93-e2c6-4590-bdd1-71254a346ceb",
"metadata": {},
"source": [
"The tagged words are returned as lists of tuples with the first tuple element containing the word and the second element containing the tag. As Python starts counting elements at 0, we want to count, how often the element with the number 1 contains either `\"AT\"` for the _Brown_ corpus or `\"art\"` for the _Floresta_ corpus.\n",
"\n",
"We do this using a list comprehension mapping each tagged word to either `True` for articles or `False` for all others and then summing and normalizing over all words."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "36e88599-4993-4372-960f-3a15b698f724",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.08541912104113704\n"
]
}
],
"source": [
"number_words_in_brown = len(brown.words())\n",
"number_articles_in_brown = sum((\"AT\" in p[1] for p in brown.tagged_words()))\n",
"fraction_articles_in_brown = number_articles_in_brown / number_words_in_brown\n",
"print(fraction_articles_in_brown)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "98368377-a708-4969-b1be-10376c14312f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.1385873156732058\n"
]
}
],
"source": [
"number_words_in_floresta = len(floresta.words())\n",
"number_articles_in_floresta = sum((\"art\" in p[1] for p in floresta.tagged_words()))\n",
"fraction_articles_in_floresta = number_articles_in_floresta / number_words_in_floresta\n",
"print(fraction_articles_in_floresta)"
]
},
{
"cell_type": "markdown",
"id": "468021a5-8494-4f72-b40e-1ea39f2a727d",
"metadata": {},
"source": [
"### Using a function\n",
"\n",
"As we've repeated almost exactly the same code twice (and as we might want to do the same for other copora), we could try and find a better way of re-using this logic. This is what functions are for."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "66ed7699-d117-487a-8d6e-d88920a2b673",
"metadata": {},
"outputs": [],
"source": [
"def fraction_of_articles(corpus, article_tag=None):\n",
" number_words = len(corpus.words())\n",
" number_articles = sum((article_tag in p[1] for p in corpus.tagged_words()))\n",
" fraction_articles = number_articles / number_words\n",
" return fraction_articles"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "506d60b5-ceaf-479d-b219-95f01b69d81f",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.08541912104113704\n"
]
}
],
"source": [
"fraction_articles_in_brown = fraction_of_articles(brown, article_tag=\"AT\")\n",
"print(fraction_articles_in_brown)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "c7a0fb35-6e50-4b38-bc98-80276d087a25",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.1385873156732058\n"
]
}
],
"source": [
"fraction_articles_in_floresta = fraction_of_articles(floresta, article_tag=\"art\")\n",
"print(fraction_articles_in_floresta)"
]
},
{
"cell_type": "markdown",
"id": "c6cd10b0-ba6d-4535-93a1-7e52c06bc351",
"metadata": {},
"source": [
"### Distance between two uses of articles\n",
"\n",
"Let's do statistics about the typical distance between two subsequent uses of (the same or different) articles."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "79e21e0c-ec6d-4ebc-a7f0-4a3edcc2663e",
"metadata": {},
"outputs": [],
"source": [
"def words_since_last_article(corpus, article_tag=None):\n",
" tagged_words = corpus.tagged_words()\n",
" distance = 0\n",
" for w in tagged_words:\n",
" if article_tag in w[1]:\n",
" yield distance # will create a generator\n",
" distance = 0\n",
" else:\n",
" distance = distance + 1"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f6c4f338-538e-46db-bea4-1554974a18b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"6\n",
"8\n"
]
}
],
"source": [
"dist = words_since_last_article(brown, article_tag=\"AT\")\n",
"print(next(dist))\n",
"print(next(dist))\n",
"print(next(dist))"
]
},
{
"cell_type": "markdown",
"id": "f6d0d1bb-2831-4917-8336-9b3f662766bf",
"metadata": {},
"source": [
"We want to put this into a Pandas datatype which has built in methods for statistics and visualisation:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "19ae948c-795f-4ee5-a9ad-983844a5091c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 99188.000000\n",
"mean 10.706920\n",
"std 10.745149\n",
"min 0.000000\n",
"25% 4.000000\n",
"50% 7.000000\n",
"75% 14.000000\n",
"max 387.000000\n",
"Name: dist_since_last_article, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"distances_brown = pd.Series(\n",
" words_since_last_article(brown, article_tag=\"AT\"),\n",
" name=\"dist_since_last_article\",\n",
")\n",
"\n",
"distances_brown.describe()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "7903e43b-5c54-4a5d-a973-ddc868e875c4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 29360.000000\n",
"mean 6.215463\n",
"std 8.009240\n",
"min 0.000000\n",
"25% 3.000000\n",
"50% 4.000000\n",
"75% 8.000000\n",
"max 1045.000000\n",
"Name: dist_since_last_article, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"distances_floresta = pd.Series(\n",
" words_since_last_article(floresta, article_tag=\"art\"),\n",
" name=\"dist_since_last_article\",\n",
")\n",
"\n",
"distances_floresta.describe()"
]
},
{
"cell_type": "markdown",
"id": "4f867fa4-5466-400b-84c2-69b6f0d986eb",
"metadata": {},
"source": [
"And some visualisation: We'll look at the quantiles of the distances between use of any article."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "c191f763-dc9e-473c-a86a-4a0d77813d37",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 700x300 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"ax = distances_brown.quantile(np.arange(0, 1, 0.1)).plot(\n",
" label=\"Brown, EN\", legend=True,\n",
" figsize=(7, 3),\n",
")\n",
"distances_floresta.quantile(np.arange(0, 1, 0.1)).plot(\n",
" ax=ax,\n",
" label=\"Floresta, PT\", legend=True,\n",
" ylabel=\"distance btw. articles\",\n",
" xlabel=\"quantile\",\n",
" grid=True,\n",
");"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
matplotlib
nltk
numpy
pandas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment