Skip to content

Instantly share code, notes, and snippets.

@jamesthomson
Created July 11, 2016 13:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jamesthomson/04d336ae3bd6c390ab57073920638af6 to your computer and use it in GitHub Desktop.
Save jamesthomson/04d336ae3bd6c390ab57073920638af6 to your computer and use it in GitHub Desktop.
basic named entity recognition example. pull out people, places, organisations
import nltk
#with open('sample.txt', 'r') as f:
# sample = f.read()
#article taken from the bbc
sample="""Renewed fighting has broken out in South Sudan between forces loyal to the president and vice-president. A reporter in the capital, Juba, told the BBC gunfire and large explosions could be heard all over the city; he said heavy artillery was being used. More than 200 people are reported to have died in clashes since Friday. The latest violence came hours after the UN Security Council called on the warring factions to immediately stop the fighting. In a unanimous statement, the council condemned the violence "in the strongest terms" and expressed "particular shock and outrage" at attacks on UN sites. It also called for additional peacekeepers to be sent to South Sudan.
Chinese media say two Chinese UN peacekeepers have now died in Juba. Several other peacekeepers have been injured, as well as a number of civilians who have been caught in crossfire. The latest round of violence erupted when troops loyal to President Salva Kiir and first Vice-President Riek Machar began shooting at each other in the streets of Juba. Relations between the two men have been fractious since South Sudan won independence from Sudan in 2011.
Their forces have fought a civil war. But despite a peace deal last year ending the conflict, both sides retain their military capabilities and have continued to accuse each other of bad faith. On Monday, there were reports of tanks on the streets of Juba and clashes close to the airport and UN camps sheltering civilians. The US embassy warned of "serious fighting" taking place. A BBC correspondent in the Kenyan capital, Nairobi, said it was not clear if Mr Kiir and Mr Machar remained in control of their forces. A UN spokeswoman in Juba, Shantal Persaud, said fighting over the past few days had caused hundreds of internally displaced people to take refuge in UN premises. She said both South Sudanese leaders were responsible for implementing last year's peace agreement, which included a permanent ceasefire and the deployment of forces away from Juba. Information Minister Michael Makuei told the BBC that the situation in the city was "under full control" and civilians who had fled should return to their homes. Mr Machar's military spokesman, Col William Gatjiath, accused officials loyal to the president of lying, and said there had been at least 10 hours of clashes on Sunday. "The situation in South Sudan is uncontrollable because Salva Kiir and his followers are not ready to follow the peace agreement," he said. In a statement on Sunday, the US state department said it strongly condemned the latest outbreak of fighting in Juba.
Spokesman John Kirby said Washington had ordered the departure of non-emergency personnel from the US embassy in Juba.
Mr Kiir and Mr Machar had met at the presidential palace on Friday and issued a call for calm.
Calm was apparently restored on Saturday but heavy gunfire broke out again on Sunday near a military barracks occupied by troops loyal to Mr Machar.
A US academic who studies Sudan, Eric Reeves, told the BBC Mr Machar was trying to orchestrate a coup against his rival, with the backing of President Omar Bashir of Sudan.
"This has been planned," he said. "That violence now seems to be part of a co-ordinated coup led by Riek Machar. This changes entirely the complexion of the crisis."
A spokesman for Mr Machar is reported to have rejected this.
"""
#split text into sentences
sentences = nltk.sent_tokenize(sample)
#split text sentences into words
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
#assign words to language types
#this is the bit that takes time
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
#define a function that extracts the named entity by type
#PERSON, ORGANIZATION, GPE, NE(if binary=TRUE)
def extract_entity_names(t, entity_type):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == entity_type:
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child, entity_type))
return entity_names
#d o for places/people/orgs
places = []
#named entity recognition
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=False)
#extract entity
for tree in chunked_sentences:
places.extend(extract_entity_names(tree, "GPE"))
places=list(set(places))
people = []
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=False)
for tree in chunked_sentences:
people.extend(extract_entity_names(tree, "PERSON"))
people=list(set(people))
orgs = []
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=False)
for tree in chunked_sentences:
orgs.extend(extract_entity_names(tree, "ORGANIZATION"))
orgs=list(set(orgs))
print places
print people
print orgs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment