Skip to content

Instantly share code, notes, and snippets.

@tgalery
tgalery / nnmf_no_datatreatment.py
Created August 15, 2012 02:04
Non-Negative Matrix Factorisation solutions to topic extraction in python

These are two solutions for a topic extraction task. The sample data is loaded into a variable by the script. I’ve included running times for both solutions, so we could have precise information about the cost that each one takes, in addition to their results. According to (Pazienza et al. 2005)
, two trends on textual information can be identified: one based on linguistic and syntactical information, another based on statistical analysis of frequency patterns (which usually consider text as a bags-of-words). Whilst the first approach is a purely syntactic one, the second one aims to imcorporate information about syntatic categories into the analysis (hence a hybrid approach)

After presenting the solutions and briefly mentioning an alternative to it, I’ll move to a short theoretical discussion.

1 – Set-up used:

*Ubuntu 11.04 Natty AMD64

*Python 2.7.3

*python re library

*python nltk 2.0 library and the required NumPy and PyYaml (For NLP tas

@tgalery
tgalery / twiterwars2.py
Created August 12, 2012 17:09
Python spell-checker for twiter stream

This is a simple python program that streams tweets from 2 locations, London and Exeter, in our example, and compares which one has the greatest number of spelling mistakes.

1 – Set-up used:

*Ubuntu 11.04 Natty AMD64
*Python 2.7.3
*python re library
*python nltk 2.0 library and the required NumPy and PyYaml (For NLP tasks)
*python tweeterstream 1.1.1 library (For Tweeter Manipulation)