Created
August 14, 2019 04:46
-
-
Save ceshine/7741974cf14d838c7a3b3e2c1031d8c7 to your computer and use it in GitHub Desktop.
Customizing Spacy's Statistical Sentence Segmenter with Custom Rules
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Reference: [Sentence Segmentation](https://spacy.io/usage/linguistic-features#sbd)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('parser', <spacy.pipeline.DependencyParser at 0x7f5bb94ad780>)]" | |
] | |
}, | |
"execution_count": 1, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import spacy\n", | |
"NLP = spacy.load(\"en_core_web_md\", disable=[\"tagger\", \"ner\"])\n", | |
"NLP.pipeline" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Spacy version: 2.0.18\n", | |
"Model version: 2.0.0\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Spacy version:\", spacy.__version__)\n", | |
"print(\"Model version:\", NLP.meta['version'])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Default sentence segmentation" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"text=\"\"\"\n", | |
"Hong Kong Shows the Flaws in China’s Zero-Sum Worldview.\n", | |
"\n", | |
"Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China’s military garrison descended into an evening of clashes, panic and widespread disruption.\n", | |
"\"\"\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['Hong Kong Shows the Flaws in China',\n", | |
" '’s',\n", | |
" 'Zero-Sum Worldview.',\n", | |
" 'Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China',\n", | |
" '’s military garrison descended into an evening of clashes, panic and widespread disruption.']" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"doc = NLP(text)\n", | |
"[sent.text.strip() for sent in doc.sents]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Custom rule-based strategy" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('set_custom_boundaries', <function __main__.set_custom_boundaries(doc)>),\n", | |
" ('parser', <spacy.pipeline.DependencyParser at 0x7f5bb94ad780>)]" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"def set_custom_boundaries(doc):\n", | |
" for token in doc[:-1]:\n", | |
" # print(token.text, token.text in (\"’s\", \"'s\"))\n", | |
" if token.text in (\"’s\", \"'s\"):\n", | |
" # print(\"Detected:\", token.text)\n", | |
" doc[token.i].is_sent_start = False\n", | |
" return doc\n", | |
"\n", | |
"NLP.add_pipe(set_custom_boundaries, before=\"parser\")\n", | |
"NLP.pipeline" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['Hong Kong Shows the Flaws in China’s Zero-Sum Worldview.',\n", | |
" 'Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China’s military garrison descended into an evening of clashes, panic and widespread disruption.']" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"doc = NLP(text)\n", | |
"[sent.text.strip() for sent in doc.sents]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment