Skip to content

Instantly share code, notes, and snippets.

View hridaydutta123's full-sized avatar
🎧
Alive and Awake

Hridoy Sankar Dutta hridaydutta123

🎧
Alive and Awake
View GitHub Profile
@mattb
mattb / gist:3888345
Created October 14, 2012 11:53
Some pointers for Natural Language Processing / Machine Learning

Here are the areas I've been researching, some things I've read and some open source packages...

Nearly all text processing starts by transforming text into vectors: http://en.wikipedia.org/wiki/Vector_space_model

Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms): http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. "wishy-washy" in English) - I use this to group words into n-gram tokens because many NLP techniques consider each word as if it's independent of all the others in a document, ignoring order: http://matpalm.com/blog/2011/10/22/collocations_1/