Skip to content

Instantly share code, notes, and snippets.

@ceshine
Created August 14, 2019 04:46
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ceshine/7741974cf14d838c7a3b3e2c1031d8c7 to your computer and use it in GitHub Desktop.
Save ceshine/7741974cf14d838c7a3b3e2c1031d8c7 to your computer and use it in GitHub Desktop.
Customizing Spacy's Statistical Sentence Segmenter with Custom Rules
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reference: [Sentence Segmentation](https://spacy.io/usage/linguistic-features#sbd)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('parser', <spacy.pipeline.DependencyParser at 0x7f5bb94ad780>)]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import spacy\n",
"NLP = spacy.load(\"en_core_web_md\", disable=[\"tagger\", \"ner\"])\n",
"NLP.pipeline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Spacy version: 2.0.18\n",
"Model version: 2.0.0\n"
]
}
],
"source": [
"print(\"Spacy version:\", spacy.__version__)\n",
"print(\"Model version:\", NLP.meta['version'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Default sentence segmentation"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"text=\"\"\"\n",
"Hong Kong Shows the Flaws in China’s Zero-Sum Worldview.\n",
"\n",
"Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China’s military garrison descended into an evening of clashes, panic and widespread disruption.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Hong Kong Shows the Flaws in China',\n",
" '’s',\n",
" 'Zero-Sum Worldview.',\n",
" 'Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China',\n",
" '’s military garrison descended into an evening of clashes, panic and widespread disruption.']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc = NLP(text)\n",
"[sent.text.strip() for sent in doc.sents]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Custom rule-based strategy"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('set_custom_boundaries', <function __main__.set_custom_boundaries(doc)>),\n",
" ('parser', <spacy.pipeline.DependencyParser at 0x7f5bb94ad780>)]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def set_custom_boundaries(doc):\n",
" for token in doc[:-1]:\n",
" # print(token.text, token.text in (\"’s\", \"'s\"))\n",
" if token.text in (\"’s\", \"'s\"):\n",
" # print(\"Detected:\", token.text)\n",
" doc[token.i].is_sent_start = False\n",
" return doc\n",
"\n",
"NLP.add_pipe(set_custom_boundaries, before=\"parser\")\n",
"NLP.pipeline"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Hong Kong Shows the Flaws in China’s Zero-Sum Worldview.',\n",
" 'Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China’s military garrison descended into an evening of clashes, panic and widespread disruption.']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"doc = NLP(text)\n",
"[sent.text.strip() for sent in doc.sents]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment