Skip to content

Instantly share code, notes, and snippets.

@tgalery
Created August 15, 2012 02:04
Show Gist options
  • Save tgalery/3354850 to your computer and use it in GitHub Desktop.
Save tgalery/3354850 to your computer and use it in GitHub Desktop.
Non-Negative Matrix Factorisation solutions to topic extraction in python

These are two solutions for a topic extraction task. The sample data is loaded into a variable by the script. I’ve included running times for both solutions, so we could have precise information about the cost that each one takes, in addition to their results. According to (Pazienza et al. 2005)
, two trends on textual information can be identified: one based on linguistic and syntactical information, another based on statistical analysis of frequency patterns (which usually consider text as a bags-of-words). Whilst the first approach is a purely syntactic one, the second one aims to imcorporate information about syntatic categories into the analysis (hence a hybrid approach)

After presenting the solutions and briefly mentioning an alternative to it, I’ll move to a short theoretical discussion.

1 – Set-up used:

*Ubuntu 11.04 Natty AMD64

*Python 2.7.3

*python re library

*python nltk 2.0 library and the required NumPy and PyYaml (For NLP tasks)

*python tweeterstream 1.1.1 library (For Tweeter Manipulation)

*python simplejason library for jason manipulation

*python sklearn 0.11 library(For machine learning tasks)

*python time module for benchmarking different solutions

Installation Instructions:

  • Python and python installation packages: from command prompt run:

  • sudo apt-get install python python-pip python-setuptools

  • NLTK, NumPy, PyYaml library: from command prompt run
    sudo pip install -U numpy
    sudo pip install -U pyyaml nltk
    Test installation: run python then type import nltk

    *Sklearn 0.11
    : from command promp run:

    sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ python-matplotlib python-pip (for dependencies)

    sudo pip install -U scikit-learn

    Theoretical assumptions:

    The data manipulated is a jason file containing articles with a variety of fields. Since the solutions implemented here are based on machine learning techniques (involving the frequency distribution of tokens in a bag of words), I only used the title and content of the article. The title was duplicated, so the frequency of words containing in it would be increased (this is a simple way to capture the intuition that words in the title are a bit more important).

    Running instructions:

    simply running the scripts nnmf_no_datatreatment.py or nnmf_noun_extraction.py would run both scripts, displaying results and times. The data.jason file should be placed in the same directory.

    The problem was broken down into three subtasks: (i) loading and per-processing the data, (ii) running a tf-idf algorithm on the data, (iii) running a non-negative frequency factorization of the tf-idf matrix.

    Step (i)

    Data Loading:
    For both solution, data loading is identical. The json file is loaded, and for each article, we generate a string containing the title duplicated and the content of the file. Each string is them appended to an empty list. The resulting list contains each documents as rows (vectors).

    Preprocessing:

    For the solution in nnmf_no_datatreatment.py, no data treatment was performed. The aim was to have a comparison between the pros and cons of pre-processing options.

    For the solution in nnmf_ noun_extraction.py, we created a filter that tokenizes, performs POS tagging on the text and filters it by removing all but noun elements. Each document corresponding to each element in the list can be regarded as a bag-of-nouns. Many authors have proposed that nouns are the linguistic units that best represent the topics of a text (for a recent example, Mihancea 2004, but many others). Personal note: I thought that verbs in the gerund (e.g. ‘bombing’), which are subject to nomilaziation rules, e.g. ‘the bombing in King’s Cross’, have a similar nouny nature, but unfortunately results are not so interesting.

    Step (ii) TF*IDF
    Roughly, TF-IDF is the ratio between two components: (i) the term frequency of a term t in a document d – tf(t,d) – and (ii) inverse document frequency- idf(t,D) – which captures the inverse frequency of a term in a corpus. So if ‘the’ appears many times, but it also appears many times in a corpus, its TF-IDF value will be low. A term that is frequent in a document, but not very frequent in a corpus will have a higher score, which means that it is statistically significant (for discussion, see Salton and Buckley 1988)
    .

    Implementation of this step involved just calls to the sklearn.feature_extraction.text.CountVectorizer module on our data. A good documentation can be found on the library’s website: http://scikit-learn.org/stable/modules/feature_extraction.html .

    Note: max_df was set to 0.95, discarding the 5% most frequent terms (the resulting data is a bit noisy, e.g. lots of pronouns) and binary=False (whilst setting it to True is useful for small chunks of text, it didn’t work well for our data).

    Step (iii) Non-Negative Matrix factorization

    Non-negative Matrix Factorization (NNMF)

    can be user as a technique for reducting the complexity of the analysis of a term-document matrix D (as in tf*idf), hence some problems in information retrieval (see Chang et al. 2002). Then the original D matrix is approximated by the product of two low rank matrices: the Document-Topic matrix W and the Topic-Word matrix H. The rank of the matrices is defined by the number k of topics. wd(i,k) in W represents the importance of the topic k to the document i and wt(k,j) represents the importance of the term j to the topic k. My implementation was a re-work of the examples given in the sklearn’s library website (ww.sci-kit.org)

    Discussion:

    As expected, solution 2, which extracts noun phrases first, takes much longer (more than 1 minute) than the solution which does not use noun extraction (less than 3 secs), but it has much cleaner data.

    As a final note, I like to point out that the solutions presented here are not exhaustive. Graph based hybrid solutions like the text-rank algorithm give very good results. This hasn’t been presented because (i) my unfamiliarity with graph based information retrieval, and (ii) the availability of a very good implementation which would only require me to load the data (you can find it here https://gist.github.com/1646117 , too good to be true really).

#DISCLAIMER: This is a re-work of the examples provided in the re-work of
#the examples given in the sklearn's library website (ww.sci-kit.org)
#most of the work are done by the libraries themselves
#thanks for the working example given by Olivier Grisel <olivier.grisel@ensta.org>
import simplejson as json
import nltk
from time import time
from sklearn.feature_extraction import text
from sklearn import decomposition
#setting up initial time for benchmark purposes
initialtime = time()
#loading of initial variables and data
corpus = [] #initializing a corpus variable
data = json.load(open('data.json')) # loads the data into a file
#creating the corpus matrix
print "Loading data..."
for article in data["articles"]:
print "Loading article entitled: " + article["title"]
document = (article["title"] + " " + article["title"] + " " + article["content"]) # note: titles are added twice, so they can be counted as more significant when calculating the requency probability matrix
corpus.append(document)
print "The total number of documents in our corpus is: " + str(len(corpus) +1)
components = len(corpus)
print "Calculating term frequency by inverse document frequency matrix..."
vectorizer = text.CountVectorizer(max_df=0.95,max_features=1000,binary=False)
counts = vectorizer.fit_transform(corpus)
tfidf = text.TfidfTransformer().fit_transform(counts)
#vectorizer = text.TfidfVectorizer(max_df=0.95, max_features=750, binary=False) #This excludes the 5% top words
#print vectorizer.vocabulary_ # uncomment to see the vocabulary
#tfidf= vectorizer.fit_transform(corpus) #this calculates the frequency matrix for the words in the corpus
print "Calculating non-negative matrix factorization..."
#n_interestingwords = 10 # sets the number of more interesting words
nmf = decomposition.NMF(n_components=components)
nmf = nmf.fit(tfidf)
features = vectorizer.get_feature_names()
print "These are the documents extracted from the data and their respective topics:"
#nmf.components_
for topicidx, topic in enumerate(nmf.components_):
print "Keywords : " + " ".join([features[i] for i in topic.argsort()[:-11:-1]]) # this gives the top 10 keywords associated with a given topic
print "Data extraction took: " + str(time() - initialtime) + "s to complete!"
#DISCLAIMER: This is a re-work of the examples provided in the re-work of
#the examples given in the sklearn's library website (ww.sci-kit.org)
#most of the work are done by the libraries themselves
#thanks for the working example given by Olivier Grisel <olivier.grisel@ensta.org>
import simplejson as json
import nltk
from time import time
from sklearn.feature_extraction import text
from sklearn import decomposition
#setting up initial time for benchmark purposes
initialtime = time()
#Data treatment component:
#General procedures for data treatment
def postagfilter(tagged, tags=['NN','NNP']): # this filters pairs whose second element is a noun related POS
return [item[0] for item in tagged if item[1] in tags]
def normalize(tagged): # this removes punctuation left by the word punkt tokenizer used to clean the initial data
return [item.replace('.', '') for item in tagged]
#loading of initial variables and data
corpus = [] #initializing a corpus variable
data = json.load(open('data.json')) # loads the data into a file
print "Loading data..."
#creating the corpus matrix
for article in data["articles"]:
document = (article["title"] + " " + article["title"] + " " + article["content"])
tokensdocument = nltk.PunktWordTokenizer().tokenize(document)
tagsdocument = nltk.pos_tag(tokensdocument)
filtereddocument = postagfilter(tagsdocument)
bagofnouns = ' '.join(normalize(filtereddocument))
corpus.append(bagofnouns)
components = len(corpus)
print "Calculating term frequency by inverse document frequency matrix..."
vectorizer = text.TfidfVectorizer(max_df=0.95, max_features=500, binary=False) #note, you have to set this to false, otherwise this leads to a feature explosion, when calculating NNMF
#print vectorizer.vocabulary_ # uncomment this, if you want to see the vocabulary
tfidf= vectorizer.fit_transform(corpus) #this calculates the frequency matrix for the words in the corpus
print "Calculating non-negative matrix factorization..."
nmf = decomposition.NMF(n_components=components)
nmf = nmf.fit(tfidf)
features = vectorizer.get_feature_names() # potential problem here
print "This are the documents extracted from the data and their respective topics:"
for topicidx, topic in enumerate(nmf.components_):
print "Keywords : " + " ".join([features[i] for i in topic.argsort()[:-11:-1]]) # this gives the top 10 keywords associated with a given topic
print "Data extraction took: " + str(time() - initialtime) + "s to complete!"
@husseinkohy
Copy link

Hi,
I believe in text.TfidfVectorizer() norm=None also needs to be passed otherwise some topics may end up having the same set of words. Normalization need to occur after nmf.fit according to this article http://web.stanford.edu/class/ee378b/papers/xu-lin-gong-nmf.pdf.

from sklearn import preprocessing
W = preprocessing.normalize(nmf.fit_transform(tfidf_matrix), norm='l2', copy=False)
H = preprocessing.normalize(nmf.components_, norm='l2', copy=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment