Skip to content

@mattb /gist:3888345

Embed URL


Subversion checkout URL

You can clone with
Download ZIP
Some pointers for Natural Language Processing / Machine Learning

Here are the areas I've been researching, some things I've read and some open source packages...

Nearly all text processing starts by transforming text into vectors:

Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms):

Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. "wishy-washy" in English) - I use this to group words into n-gram tokens because many NLP techniques consider each word as if it's independent of all the others in a document, ignoring order:

When you've got a lot of text and you don't know what the patterns in it are, you can run an "unsupervised" clustering using Latent Dirichlet allocation:

Or if you know how your data is divided into topics, otherwise known as "labeled data", then you can run "supervised" techniques such as training a classifier to predict the labels of new similar data. I can't find a really good page on this - I picked up a lot in IM with my friend Ben who is writing a book coming out next year:

Here are the tools I've mostly been using:

Vowpal Wabbit (classification and LDA, poor documentation, C++ high performance):

Gensim (LDA, vector similarity, text processing, python):

Mallet (classification and LDA, java):

Lingpipe (text analysis, clustering, classification, linguistics, java, commercial open-source):

Mahout (Hadoop, classification, clustering, LDA, collaborative filtering, java):

Langdetect (language detection, java):

Some blogs I like:

MetaOptimize Q+A is the Stack Overflow of ML:

The Mahout In Action book is quite good and practical:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.