Created
January 3, 2018 00:32
-
-
Save jzonthemtn/ea6923c1e4595eb61e45f7a8ceb6f83d to your computer and use it in GitHub Desktop.
Jupyter notebook for NLP Building Blocks
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In this notebook we show how to use the NLP Building Blocks to perform named-entity extraction from natural language text. You can launch the NLP Building Blocks in Docker containers using the docker-compose script at https://github.com/mtnfog/nlp-building-blocks.\n", | |
"\n", | |
"First, we'll include the Python requests library." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import json\n", | |
"import requests" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we are going to define a function to perform sentence extraction. This function makes an API call to Prose Sentence Extraction Engine. The API takes in natural language text and returns a JSON array containing the individual sentences in the text." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def extract_sentences(text):\n", | |
" headers = {'Content-Type': 'text/plain'}\n", | |
" api_url = 'http://192.168.1.134:8060/api/sentences'\n", | |
" response = requests.post(api_url, headers=headers, data=text)\n", | |
" return json.loads(response.content.decode('utf-8'))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Similar to above, we are now going to define a function to perform tokenization using Sonnet Tokenization Engine. The API call takes an individual sentence (extracted by the function above) and returns a JSON array containing the individual tokens in the sentence." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def tokenize(sentence):\n", | |
" headers = {'Content-Type': 'text/plain'}\n", | |
" api_url = 'http://192.168.1.134:9040/api/tokenize'\n", | |
" response = requests.post(api_url, headers=headers, data=sentence)\n", | |
" return json.loads(response.content.decode('utf-8'))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Lastly, we make a function to extract named-entities from the text. This API call uses Idyl E3 Entity Extraction Engine. The API call posts the tokens (extracted by the function above) and returns any found named-entities. Our Idyl E3 is running a trained model for English-language person entities." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def extract_entities(tokens):\n", | |
" headers = {'Content-Type': 'application/json'}\n", | |
" api_url = 'http://192.168.1.134:9000/api/extract'\n", | |
" response = requests.post(api_url, headers=headers, json=tokens)\n", | |
" return json.loads(response.content.decode('utf-8')) " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we are ready to execute the API calls. We first extract the sentences from text, then we tokenize each sentence, and lastly, we look for named-entities in the tokens." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"{u'entities': [], u'extractionTime': 1}\n", | |
"{u'entities': [{u'languageCode': u'eng', u'confidence': 0.96, u'span': {u'tokenEnd': 2, u'tokenStart': 0}, u'extractionDate': 1514939299940, u'text': u'George Washington', u'type': u'person', u'metadata': {u'x-model-filename': u'mtnfog-en-person.bin'}}], u'extractionTime': 1}\n" | |
] | |
} | |
], | |
"source": [ | |
"# Extract the sentences in the input text.\n", | |
"sentences = extract_sentences('This is a sentence. George Washington was president.')\n", | |
"\n", | |
"for s in sentences:\n", | |
" # Tokenize each sentence.\n", | |
" tokens = tokenize(s)\n", | |
" # Extract entities from the tokens.\n", | |
" print(extract_entities(tokens))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In the output we see the results for the two sentences. The first sentence did not include any named-entities. The second sentence included one named-entity \"George Washington\" identified as a person. The response also includes the entity's location in the text, the confidence that this is an entity, as well as the file name of the model that identified this entity." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.12" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment