Created
January 28, 2018 16:47
-
-
Save CateGitau/3eac49225636ffdd7cc9268f4f1c94c6 to your computer and use it in GitHub Desktop.
Named Entity Recognition
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
options(java.parameters = "- Xmx1024m") | |
#load libraries | |
library(openxlsx) | |
library(rJava) | |
library(NLP) | |
library(openNLP) | |
library(RWeka) | |
#read text | |
text <- c("My name is Catherine Gitau, I work at Ongair Limited in Nairobi, Kenya") | |
#convert the character vectors into one character vector | |
text <- paste(text, collapse = " ") | |
print(text) | |
#converts bio variable into a string | |
text<- as.String(text) | |
#create annotators for words and sentences | |
word_ann <- Maxent_Word_Token_Annotator() | |
sent_ann <- Maxent_Sent_Token_Annotator() | |
#Identifies where the sentences are and the words | |
text_annotations <- annotate(text, list(sent_ann, word_ann)) | |
head(text_annotations) | |
#combines bio and the annotations | |
text_doc <- AnnotatedPlainTextDocument(text, text_annotations) | |
words(text_doc) | |
#creates annotators of kind person, location and organization | |
person_ann <- Maxent_Entity_Annotator(kind = "person") | |
location_ann <- Maxent_Entity_Annotator(kind = "location") | |
organization_ann <- Maxent_Entity_Annotator(kind = "organization") | |
#holds annotators in the order to be applied | |
pipeline <- list(sent_ann, | |
word_ann, | |
person_ann, | |
location_ann, | |
organization_ann) | |
text_annotations <- annotate(text, pipeline) | |
text_doc <- AnnotatedPlainTextDocument(text, text_annotations) | |
# Extract entities from an AnnotatedPlainTextDocument | |
entities <- function(text, kind) { | |
s <- text$content | |
a <- annotations(text)[[1]] | |
if(hasArg(kind)) { | |
k <- sapply(a$features, `[[`, "kind") | |
s[a[k == kind]] | |
} else { | |
s[a[a$type == "entity"]] | |
} | |
} | |
entities(text_doc, kind = "person") | |
For those looking the entities function does not work because annotations was deprecated, found the solution on a YouTube comment so thought I'd cross post for those searching. You will need to install and load the coreNLP library, update annotations to read annotation and remove the brackets after the document, this should then recognize and output the entities identified when running the function with a kind specified.
Install.packages("coreNLP")
library(coreNLP)
entities <- function(doc, kind) {
s <- doc$content
a <- annotation(doc)
if(hasArg(kind)) {
k <- sapply(a$features, [[
, "kind")
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
}
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi Catherine, thank you for the code you presented above. It helps a lot. However, when I run the code, I obtained the following results
In the annotations, see the following example results, I hope it worked well
Where do you think is the problem for the final result to be as character(0)?
I am working on a twitter data and would like to get place names and nouns mentioned in the tweets so that I can geocode the places.
Thank you for helpiing