Skip to content

Instantly share code, notes, and snippets.

@jzonthemtn
Created January 3, 2018 00:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jzonthemtn/ea6923c1e4595eb61e45f7a8ceb6f83d to your computer and use it in GitHub Desktop.
Save jzonthemtn/ea6923c1e4595eb61e45f7a8ceb6f83d to your computer and use it in GitHub Desktop.
Jupyter notebook for NLP Building Blocks
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we show how to use the NLP Building Blocks to perform named-entity extraction from natural language text. You can launch the NLP Building Blocks in Docker containers using the docker-compose script at https://github.com/mtnfog/nlp-building-blocks.\n",
"\n",
"First, we'll include the Python requests library."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to define a function to perform sentence extraction. This function makes an API call to Prose Sentence Extraction Engine. The API takes in natural language text and returns a JSON array containing the individual sentences in the text."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"def extract_sentences(text):\n",
" headers = {'Content-Type': 'text/plain'}\n",
" api_url = 'http://192.168.1.134:8060/api/sentences'\n",
" response = requests.post(api_url, headers=headers, data=text)\n",
" return json.loads(response.content.decode('utf-8'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to above, we are now going to define a function to perform tokenization using Sonnet Tokenization Engine. The API call takes an individual sentence (extracted by the function above) and returns a JSON array containing the individual tokens in the sentence."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"def tokenize(sentence):\n",
" headers = {'Content-Type': 'text/plain'}\n",
" api_url = 'http://192.168.1.134:9040/api/tokenize'\n",
" response = requests.post(api_url, headers=headers, data=sentence)\n",
" return json.loads(response.content.decode('utf-8'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, we make a function to extract named-entities from the text. This API call uses Idyl E3 Entity Extraction Engine. The API call posts the tokens (extracted by the function above) and returns any found named-entities. Our Idyl E3 is running a trained model for English-language person entities."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def extract_entities(tokens):\n",
" headers = {'Content-Type': 'application/json'}\n",
" api_url = 'http://192.168.1.134:9000/api/extract'\n",
" response = requests.post(api_url, headers=headers, json=tokens)\n",
" return json.loads(response.content.decode('utf-8')) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are ready to execute the API calls. We first extract the sentences from text, then we tokenize each sentence, and lastly, we look for named-entities in the tokens."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{u'entities': [], u'extractionTime': 1}\n",
"{u'entities': [{u'languageCode': u'eng', u'confidence': 0.96, u'span': {u'tokenEnd': 2, u'tokenStart': 0}, u'extractionDate': 1514939299940, u'text': u'George Washington', u'type': u'person', u'metadata': {u'x-model-filename': u'mtnfog-en-person.bin'}}], u'extractionTime': 1}\n"
]
}
],
"source": [
"# Extract the sentences in the input text.\n",
"sentences = extract_sentences('This is a sentence. George Washington was president.')\n",
"\n",
"for s in sentences:\n",
" # Tokenize each sentence.\n",
" tokens = tokenize(s)\n",
" # Extract entities from the tokens.\n",
" print(extract_entities(tokens))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the output we see the results for the two sentences. The first sentence did not include any named-entities. The second sentence included one named-entity \"George Washington\" identified as a person. The response also includes the entity's location in the text, the confidence that this is an entity, as well as the file name of the model that identified this entity."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment