Let p(x) be the probability mass function of a random variable X over a discrte set of symbols X:
p(x) = P(X=x)
For example, if we toss two coins and count the no. of heads, we have a random variable: p(0) = 1/4, p(1) = 1/2 and p(2) = 1/4
"""Query AlchemyAPI to determine number of API calls still available""" | |
# -*- coding: utf-8 -*- | |
import json | |
import requests | |
def get_api_key(): | |
# Load API key (40 HEX character key) from local file | |
key = open('api_key.txt').readline().strip() | |
return key |
import nltk | |
text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital | |
computer or the gears of a cycle transmission as he does at the top of a mountain | |
or in the petals of a flower. To think otherwise is to demean the Buddha...which is | |
to demean oneself.""" | |
# Used when tokenizing words | |
sentence_re = r'''(?x) # set flag to allow verbose regexps | |
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A. |
""" | |
Programming task | |
================ | |
Implement the method iter_sample below to make the Unit test pass. iter_sample | |
is supposed to peek at the first n elements of an iterator, and determine the | |
minimum and maximum values (using their comparison operators) found in that | |
sample. To make it more interesting, the method is supposed to return an | |
iterator which will return the same exact elements that the original one would | |
have yielded, i.e. the first n elements can't be missing. |
""" | |
Programming task | |
================ | |
The following is an implementation of a simple Named Entity Recognition (NER). | |
NER is concerned with identifying place names, people names or other special | |
identifiers in text. | |
Here we make a very simple definition of a named entity: A sequence of | |
at least two consecutive capitalized words. E.g. "Los Angeles" is a named |
""" | |
This is a script used to clean control characters from the | |
- NTU -Multilingual Corpus (http://web.mysites.ntu.edu.sg/fcbond/open/pubs/2012-ijalp-ntumc.pdf) | |
- SeedLing Corpus (http://www.aclweb.org/anthology/W/W14/W14-2211.pdf) | |
- DSL Corpus Collection (https://comparable.limsi.fr/bucc2014/4.pdf) | |
""" | |
import re | |
import unicodedata | |
# A full list of unicode characters. |
With NLTK version 3.1 and Stanford NER tool 2015-12-09, it is possible to hack the StanfordNERTagger._stanford_jar
to include other .jar
files that are necessary for the new tagger.
First set up the environment variables as per instructed at https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software
# -*- coding: utf-8 -*- | |
"""BLEU. | |
Usage: | |
bleu.py --reference FILE --translation FILE [--weights STR] [--smooth STR] [--smooth-epsilon STR] [--smooth-alpha STR] [--smooth-k STR] [--segment-level] | |
bleu.py -r FILE -t FILE [-w STR] [--smooth STR] [--segment-level] | |
Options: | |
-h --help Show this screen. |
from nltk.corpus import wordnet as wn | |
from nltk.stem import PorterStemmer, WordNetLemmatizer | |
#from nltk import pos_tag, word_tokenize | |
# Pywsd's Lemmatizer. | |
porter = PorterStemmer() | |
wnl = WordNetLemmatizer() | |
from nltk.tag import PerceptronTagger |
Firstly, I strongly think that if you're working with NLP/ML/AI related tools, getting things to work on Linux and Mac OS is much easier and save you quite a lot of time.
Disclaimer: I am not affiliated with Continuum (conda), Git, Java, Windows OS or Stanford NLP or MaltParser group. And the steps presented below is how I, IMHO, would setup a Windows computer if I own one.
Please please please understand the solution don't just copy and paste!!! We're not monkeys typing Shakespeare ;P