Skip to content

Instantly share code, notes, and snippets.

@sujitpal
Created April 24, 2016 17:15
Show Gist options
  • Save sujitpal/87ca5f71696744f8790a1b3bcea5443b to your computer and use it in GitHub Desktop.
Save sujitpal/87ca5f71696744f8790a1b3bcea5443b to your computer and use it in GitHub Desktop.
JUnit/OpenNLP code to extract Noun Phrases from text. Originally from my pastebin (http://pastebin.com/bUDY7fb0).
@Test
public void testNounPhraseExtractionStandalone() throws Exception {
SentenceDetectorME sentenceDetector;
TokenizerME tokenizer;
POSTaggerME posTagger;
ChunkerME chunker;
InputStream smis = null;
InputStream tmis = null;
InputStream pmis = null;
InputStream cmis = null;
try {
smis = new FileInputStream(new File("/Users/sujit/Projects/tgni/src/main/resources/models/en_sent.bin"));
tmis = new FileInputStream(new File("/Users/sujit/Projects/tgni/src/main/resources/models/en_token.bin"));
pmis = new FileInputStream(new File("/Users/sujit/Projects/tgni/src/main/resources/models/en_pos_maxent.bin"));
cmis = new FileInputStream(new File("/Users/sujit/Projects/tgni/src/main/resources/models/en_chunker.bin"));
SentenceModel smodel = new SentenceModel(smis);
sentenceDetector = new SentenceDetectorME(smodel);
TokenizerModel tmodel = new TokenizerModel(tmis);
tokenizer = new TokenizerME(tmodel);
POSModel pmodel = new POSModel(pmis);
posTagger = new POSTaggerME(pmodel);
ChunkerModel cmodel = new ChunkerModel(cmis);
chunker = new ChunkerME(cmodel);
} finally {
IOUtils.closeQuietly(cmis);
IOUtils.closeQuietly(pmis);
IOUtils.closeQuietly(tmis);
IOUtils.closeQuietly(smis);
}
String text = "This article provides a review of the literature on clinical correlates of awareness in dementia. Most inconsistencies were found with regard to an association between depression and higher levels of awareness. Dysthymia, but not major depression, is probably related to higher levels of awareness. Anxiety also appears to be related to higher levels of awareness. Apathy and psychosis are frequently present in patients with less awareness, and may share common neuropathological substrates with awareness. Furthermore, unawareness seems to be related to difficulties in daily life functioning, increased caregiver burden, and deterioration in global dementia severity. Factors that may be of influence on the inconclusive data are discussed, as are future directions of research.";
Span[] sentSpans = sentenceDetector.sentPosDetect(text);
for (Span sentSpan : sentSpans) {
String sentence = sentSpan.getCoveredText(text).toString();
int start = sentSpan.getStart();
Span[] tokSpans = tokenizer.tokenizePos(sentence);
String[] tokens = new String[tokSpans.length];
for (int i = 0; i < tokens.length; i++) {
tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
}
String[] tags = posTagger.tag(tokens);
Span[] chunks = chunker.chunkAsSpans(tokens, tags);
for (Span chunk : chunks) {
if ("NP".equals(chunk.getType())) {
int npstart = start + tokSpans[chunk.getStart()].getStart();
int npend = start + tokSpans[chunk.getEnd() - 1].getEnd();
System.out.println(text.substring(npstart, npend));
}
}
}
}
produces following noun phrases:
[junit] This article
[junit] a review
[junit] the literature
[junit] clinical correlates
[junit] awareness
[junit] dementia
[junit] Most inconsistencies
[junit] regard
[junit] an association
[junit] depression and higher levels
[junit] awareness
[junit] Dysthymia
[junit] not major depression
[junit] higher levels
[junit] awareness
[junit] Anxiety
[junit] higher levels
[junit] awareness
[junit] psychosis
[junit] patients
[junit] less awareness
[junit] common neuropathological substrates
[junit] awareness
[junit] unawareness
[junit] difficulties
[junit] daily life functioning
[junit] caregiver burden
[junit] deterioration
[junit] global dementia severity
[junit] Factors
[junit] that
[junit] influence
[junit] the inconclusive data
[junit] future directions
[junit] research
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment