Skip to content

Instantly share code, notes, and snippets.

@joshlk
Last active October 23, 2023 18:24
Show Gist options
  • Save joshlk/87c1feecd82e53d0bc860580893aa8f5 to your computer and use it in GitHub Desktop.
Save joshlk/87c1feecd82e53d0bc860580893aa8f5 to your computer and use it in GitHub Desktop.
A comparison of different sentence segmentation models for the English language. The Brown corpus is used to benchmark the models.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sentence segmentation benchmark\n",
"\n",
"A comparison of different sentence segmentation models for the English language. The [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) is used to benchmark the models:\n",
"* Simple punctuation split (\".\", \"!\", \"?\")\n",
"* [Unicode sentence breaks specification](https://unicode-rs.github.io/unicode-segmentation/unicode_segmentation/trait.UnicodeSegmentation.html#tymethod.split_sentence_bounds) (Rust module)\n",
"* [NLTK Punkt model](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer) (pre-trained model)\n",
"* [spaCy model which uses dependency parsing](https://spacy.io/usage/linguistic-features#sbd-parser)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Useful refs about sentence segmentation:\n",
"* [Dissertation: An Adaptive Sentence Segmentation System](https://arxiv.org/pdf/cmp-lg/9503019.pdf) \n",
"* [Book: Tokenisation and Sentence Segmentation](https://pdfs.semanticscholar.org/eeb9/3adb89f0621fd13c8701b40eaeae74e0c804.pdf)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"import numpy as np\n",
"from nltk.corpus import brown\n",
"from sacremoses import MosesTokenizer, MosesDetokenizer"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.download('brown', quiet=True)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"def sent_indices_from_list(sents, space_included=False):\n",
" \"\"\"\n",
" Convert a list of sentences into a list of indiceis indicating the setence spans.\n",
" For example:\n",
" Non-tokenised text: \"A. BB. C.\"\n",
" sents: [\"A.\", \"BB.\", \"C.\"] -> [3, 7]\n",
" \"\"\"\n",
" indices = []\n",
" offset = 0\n",
" for sentence in sents[:-1]:\n",
" offset += len(sentence)\n",
" if not space_included:\n",
" offset += 1\n",
" indices += [offset]\n",
" return indices\n",
"\n",
"def evaluate_indices(true, pred):\n",
" \"\"\"\n",
" Calculate the Precision, Recall and F1-score. Input is a list of indices of the sentence spans.\n",
" \"\"\"\n",
" true, pred = set(true), set(pred)\n",
" TP = len(pred.intersection(true))\n",
" FP = len(pred - true)\n",
" FN = len(true - pred)\n",
" \n",
" return TP, FP, FN\n",
"\n",
"def score(TP, FP, FN):\n",
" precision = TP / (TP + FP)\n",
" recall = TP / (TP + FN)\n",
" f1 = 2*(precision*recall)/(precision+recall) if precision+recall!=0 else 0\n",
" return precision, recall, f1\n",
" \n",
"\n",
"def evaluate_recall_plus_minus_1(true, pred):\n",
" \"\"\"\n",
" Aprox. calculate the Recall given the predicted can be +/-1 from the true value.\n",
" \"\"\"\n",
" pred_pm_1 = set(e for idx in pred for e in [idx-1, idx, idx+1])\n",
" \n",
" true, pred = set(true), set(pred)\n",
" TP = len(pred_pm_1.intersection(true))\n",
" FN = len(true - pred_pm_1)\n",
" recall = TP / (TP + FN)\n",
" return recall"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Prepare Brown corpus\n",
"\n",
"The [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) is made up of 500 documents, which each contain many sentences. The original text (natural language) is not present in the data and needs to be recompiled. The conversion process taken had be inspired from a [StackOverflow post](https://stackoverflow.com/a/47301618/1110328)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"500"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc_names = brown.fileids()\n",
"len(doc_names)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Reconstruct natural text\n",
"detokenizer = MosesDetokenizer()\n",
"\n",
"# Gives a list of documents which contains a list of setneces. List[List[str]]\n",
"brown_natural_docs_sents = [\n",
" [\n",
" detokenizer.detokenize(\n",
" ' '.join(sent)\\\n",
" .replace('``', '\"')\\\n",
" .replace(\"''\", '\"')\\\n",
" .replace('`', \"'\")\\\n",
" .split()\n",
" , return_str=True)\n",
" for sent in brown.sents(doc)\n",
" ]\n",
" for doc in doc_names\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Reproduce documents by joining senteces together. List[str]\n",
"brown_natural_docs = [' '.join(doc) for doc in brown_natural_docs_sents]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Determine sentence indicies. List[List[int]]\n",
"brown_sent_indicies = [sent_indices_from_list([sent for sent in doc]) for doc in brown_natural_docs_sents]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Store total length of each document\n",
"total_len = [len(' '.join(sent for sent in doc)) for doc in brown_natural_docs_sents]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Punctuation spliter"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def split_on_punct(doc: str):\n",
" \"\"\" Split document by sentences using punctuation \".\", \"!\", \"?\". \"\"\"\n",
" punct_set = {'.', '!', '?'}\n",
" \n",
" start = 0\n",
" seen_period = False\n",
" \n",
" for i, token in enumerate(doc): \n",
" is_punct = token in punct_set\n",
" if seen_period and not is_punct:\n",
" if re.match('\\s', token):\n",
" yield doc[start : i+1]\n",
" start = i+1\n",
" else:\n",
" yield doc[start : i]\n",
" start = i\n",
" seen_period = False\n",
" elif is_punct:\n",
" seen_period = True\n",
" if start < len(doc):\n",
" yield doc[start : len(doc)]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Split documents\n",
"punct_sents_str = [list(split_on_punct(doc)) for doc in brown_natural_docs]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Obtain indicies\n",
"punct_sent_indicies = [sent_indices_from_list(doc, space_included=True) for doc in punct_sents_str]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# Evaluate metrics per document\n",
"punct_sent_metrics = np.array([evaluate_indices(brown_sent_indicies[i], punct_sent_indicies[i])\n",
" for i in range(len(brown_sent_indicies))])"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision: 0.896\n",
"Recall: 0.915\n",
"F1-score: 0.906\n"
]
}
],
"source": [
"# Sum counts accross all docs and calcaulte metrics\n",
"punct_sent_metrics_avg = score(*punct_sent_metrics.sum(axis=0))\n",
"print(\"Precision: %.3f\" % punct_sent_metrics_avg[0])\n",
"print(\"Recall: %.3f\" % punct_sent_metrics_avg[1])\n",
"print(\"F1-score: %.3f\" % punct_sent_metrics_avg[2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Unicode splitter"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# Light pyO3 wrapper around UnicodeSegmentation::split_sentence_bounds function (unicode-segmentation = \"1.6.0\")\n",
"from unicode_seg import split_sentence_bounds "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# Split documents\n",
"unicode_sents_str = [split_sentence_bounds(doc) for doc in brown_natural_docs]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# Determine indices\n",
"unicode_sents_indicies = [sent_indices_from_list(doc, space_included=True) for doc in unicode_sents_str]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"unicode_sents_metrics = np.array([evaluate_indices(brown_sent_indicies[i], unicode_sents_indicies[i])\n",
" for i in range(len(brown_sent_indicies))])"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision: 0.938\n",
"Recall: 0.912\n",
"F1-score: 0.925\n"
]
}
],
"source": [
"# Sum counts accross all docs and calcaulte metrics\n",
"unicode_sents_metrics_avg = score(*unicode_sents_metrics.sum(axis=0))\n",
"print(\"Precision: %.3f\" % unicode_sents_metrics_avg[0])\n",
"print(\"Recall: %.3f\" % unicode_sents_metrics_avg[1])\n",
"print(\"F1-score: %.3f\" % unicode_sents_metrics_avg[2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NLTK Punkt"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"from nltk.tokenize.punkt import PunktSentenceTokenizer\n",
"import nltk"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'3.4.5'"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nltk.__version__"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# Split documents\n",
"punkt = PunktSentenceTokenizer()\n",
"punkt_sents_str = [punkt.tokenize(doc) for doc in brown_natural_docs]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Determin indices of sentences\n",
"punkt_sent_indicies = [sent_indices_from_list(doc, space_included=False) for doc in punkt_sents_str]"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"# Evaluate counts per document\n",
"punkt_sent_metrics = np.array([evaluate_indices(brown_sent_indicies[i], punkt_sent_indicies[i])\n",
" for i in range(len(brown_sent_indicies))])"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision: 0.907\n",
"Recall: 0.875\n",
"F1-score: 0.891\n"
]
}
],
"source": [
"# Sum counts accross all docs and calcaulte metrics\n",
"punkt_sent_metrics_avg = score(*punkt_sent_metrics.sum(axis=0))\n",
"print(\"Precision: %.3f\" % punkt_sent_metrics_avg[0])\n",
"print(\"Recall: %.3f\" % punkt_sent_metrics_avg[1])\n",
"print(\"F1-score: %.3f\" % punkt_sent_metrics_avg[2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## spaCy"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"import spacy"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"nlp = spacy.load(\"en_core_web_sm\")"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'2.2.5'"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nlp.meta['version']"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# Process each document using spaCy\n",
"spacy_docs = list(nlp.pipe(brown_natural_docs, n_threads=3, disable=['ner']))"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# Convert list to strings for each sentence\n",
"spacy_sents_str = [[sents.text_with_ws for sents in doc.sents] for doc in spacy_docs]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# Determine sentence indicies\n",
"spacy_sent_indicies = [sent_indices_from_list(doc, space_included=True) for doc in spacy_sents_str]"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"# Calcualte counts for each document\n",
"spacy_metrics = np.array([evaluate_indices(brown_sent_indicies[i], spacy_sent_indicies[i])\n",
" for i in range(len(brown_sent_indicies))])"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision: 0.924\n",
"Recall: 0.908\n",
"F1-score: 0.916\n"
]
}
],
"source": [
"# Sum counts accross all docs and calcaulte metrics\n",
"spacy_metrics_avg = score(*spacy_metrics.sum(axis=0))\n",
"print(\"Precision: %.3f\" % spacy_metrics_avg[0])\n",
"print(\"Recall: %.3f\" % spacy_metrics_avg[1])\n",
"print(\"F1-score: %.3f\" % spacy_metrics_avg[2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### spaCy edge case\n",
"\n",
"It has been noticed that when sentences are enclosed with quote marks spaCy and the Brown corpus split the sentences in slightly different places. For example:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Fears prejudicial aspects',\n",
" '\"The statements may be highly prejudicial to my client\", Bellows told the court.']"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"brown_natural_docs_sents[2][4:6]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Fears prejudicial aspects \"',\n",
" 'The statements may be highly prejudicial to my client\", Bellows told the court. ']"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spacy_sents_str[2][4:6]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To compensate for this we can calculate the recall given the predicted can be +/-1 from the true value."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"spacy_metrics_pm_1 = [evaluate_recall_plus_minus_1(brown_sent_indicies[i], spacy_sent_indicies[i])\n",
" for i in range(len(brown_sent_indicies))]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Recall ±1: 0.916\n"
]
}
],
"source": [
"spacy_metrics_pm_1_avg = np.array(spacy_metrics_pm_1).mean(axis=0)\n",
"print(\"Recall ±1: %.3f\" % spacy_metrics_pm_1_avg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This doesn't change the score much."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Apendix: Verify evaluation\n",
"\n",
"Check that evaluation function results is the same as scikit-learn output"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.metrics import precision_score, recall_score, f1_score"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"true_array = np.zeros(total_len[0])\n",
"true_array[np.array(brown_sent_indicies[0])-1] = 1\n",
"pred_array = np.zeros(total_len[0])\n",
"pred_array[np.array(spacy_sent_indicies[0])-1] = 1"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.9782608695652174, 0.9278350515463918, 0.9523809523809524)"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"precision_score(true_array, pred_array), recall_score(true_array, pred_array), f1_score(true_array, pred_array)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(0.9782608695652174, 0.9278350515463918, 0.9523809523809524)"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"score(*evaluate_indices(brown_sent_indicies[0], spacy_sent_indicies[0]))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment