I'm playing around with NER (Named Entity Recognition) and the basic idea is that I can pass in multiple paragraphs and get recognized entities in a nicely formatted dictionary of lists.
I might look into running the Java servelet that stanford made to increase performance.
{
'organizations': ['Wall', 'Street', 'Journal', 'Apple', 'Inc.', 'Apple', 'TV', 'Apple', 'Mac', 'App', 'Store', 'Apple', 'Computer', ',', 'Inc.', 'Apple', 'Inc.', 'National', 'Hockey', 'League', 'Montreal', 'Canadiens', 'Stanley', 'Cups', 'Toronto', 'Blue', 'Jays'],
'locations': ['France', 'Cupertino', 'California'],
'persons': ['Christine', 'Lagarde', 'Steve', 'Jobs', 'Steve', 'Wozniak', 'Ronald', 'Wayne', 'Samuel', 'Patterson', 'Smyth', 'Sam', "''", 'Pollock', 'Pollock']
}
- Python: 3.5.1
- NLTK: 3.2.1
- Stanford NER: 3.6.0
NOTE: You have to use nltk.download()
after installing it to get additional dependencies.
I have a project in which important data need to be extracted from text file like organization, some date, etc. how to proceed with this .