Let p(x) be the probability mass function of a random variable X over a discrte set of symbols X:
p(x) = P(X=x)
For example, if we toss two coins and count the no. of heads, we have a random variable: p(0) = 1/4, p(1) = 1/2 and p(2) = 1/4
| """Query AlchemyAPI to determine number of API calls still available""" | |
| # -*- coding: utf-8 -*- | |
| import json | |
| import requests | |
| def get_api_key(): | |
| # Load API key (40 HEX character key) from local file | |
| key = open('api_key.txt').readline().strip() | |
| return key |
| import nltk | |
| text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital | |
| computer or the gears of a cycle transmission as he does at the top of a mountain | |
| or in the petals of a flower. To think otherwise is to demean the Buddha...which is | |
| to demean oneself.""" | |
| # Used when tokenizing words | |
| sentence_re = r'''(?x) # set flag to allow verbose regexps | |
| ([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A. |
| """ | |
| Programming task | |
| ================ | |
| Implement the method iter_sample below to make the Unit test pass. iter_sample | |
| is supposed to peek at the first n elements of an iterator, and determine the | |
| minimum and maximum values (using their comparison operators) found in that | |
| sample. To make it more interesting, the method is supposed to return an | |
| iterator which will return the same exact elements that the original one would | |
| have yielded, i.e. the first n elements can't be missing. |
| """ | |
| Programming task | |
| ================ | |
| The following is an implementation of a simple Named Entity Recognition (NER). | |
| NER is concerned with identifying place names, people names or other special | |
| identifiers in text. | |
| Here we make a very simple definition of a named entity: A sequence of | |
| at least two consecutive capitalized words. E.g. "Los Angeles" is a named |
| """ | |
| This is a script used to clean control characters from the | |
| - NTU -Multilingual Corpus (http://web.mysites.ntu.edu.sg/fcbond/open/pubs/2012-ijalp-ntumc.pdf) | |
| - SeedLing Corpus (http://www.aclweb.org/anthology/W/W14/W14-2211.pdf) | |
| - DSL Corpus Collection (https://comparable.limsi.fr/bucc2014/4.pdf) | |
| """ | |
| import re | |
| import unicodedata | |
| # A full list of unicode characters. |
With NLTK version 3.1 and Stanford NER tool 2015-12-09, it is possible to hack the StanfordNERTagger._stanford_jar to include other .jar files that are necessary for the new tagger.
First set up the environment variables as per instructed at https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software
| # -*- coding: utf-8 -*- | |
| """BLEU. | |
| Usage: | |
| bleu.py --reference FILE --translation FILE [--weights STR] [--smooth STR] [--smooth-epsilon STR] [--smooth-alpha STR] [--smooth-k STR] [--segment-level] | |
| bleu.py -r FILE -t FILE [-w STR] [--smooth STR] [--segment-level] | |
| Options: | |
| -h --help Show this screen. |
| from nltk.corpus import wordnet as wn | |
| from nltk.stem import PorterStemmer, WordNetLemmatizer | |
| #from nltk import pos_tag, word_tokenize | |
| # Pywsd's Lemmatizer. | |
| porter = PorterStemmer() | |
| wnl = WordNetLemmatizer() | |
| from nltk.tag import PerceptronTagger |
Firstly, I strongly think that if you're working with NLP/ML/AI related tools, getting things to work on Linux and Mac OS is much easier and save you quite a lot of time.
Disclaimer: I am not affiliated with Continuum (conda), Git, Java, Windows OS or Stanford NLP or MaltParser group. And the steps presented below is how I, IMHO, would setup a Windows computer if I own one.
Please please please understand the solution don't just copy and paste!!! We're not monkeys typing Shakespeare ;P