tgalery/nnmf_no_datatreatment.py

## gistfile1.textile

      
    Raw
  

              gistfile1.textile
            
          
    These are two solutions for a topic extraction task. The sample data is loaded into a variable by the script. I’ve included running times for both solutions, so we could have precise information about the cost that each one takes, in addition to their results. According to (Pazienza et al. 2005)

, two trends on textual information can be identified: one based on linguistic and syntactical information, another based on statistical analysis of frequency patterns (which usually consider text as a bags-of-words). Whilst the first approach is a purely syntactic one, the second one aims to imcorporate information about syntatic categories into the analysis (hence a hybrid approach)
After presenting the solutions and briefly mentioning an alternative to it, I’ll move to a short theoretical discussion.
	1 – Set-up used:
	*Ubuntu 11.04 Natty AMD64
	*Python 2.7.3
	*python re library
	*python nltk 2.0 library and the required NumPy and PyYaml (For NLP tasks)
	*python tweeterstream 1.1.1 library (For Tweeter Manipulation)
	*python simplejason library for jason manipulation
	*python sklearn 0.11 library(For machine learning tasks)
	*python time module for benchmarking different solutions
	Installation Instructions:

		Python and python installation packages: from command prompt run: 


sudo apt-get install python python-pip python-setuptools

		
NLTK, NumPy, PyYaml library: from command prompt run 

sudo pip install -U numpy 

sudo pip install -U pyyaml nltk

Test installation: run python then type import nltk
	*Sklearn  0.11

: from command promp run:
sudo apt-get install python-dev python-numpy python-numpy-dev python-setuptools python-numpy-dev python-scipy libatlas-dev g++ python-matplotlib
python-pip (for dependencies)
	sudo pip install -U scikit-learn
	Theoretical assumptions:
	The data manipulated is a jason file containing articles with a variety of fields. Since the solutions implemented here are based on machine learning techniques (involving the frequency distribution of tokens in a bag of words), I only used the title and content of the article. The title was duplicated, so the frequency of words containing in it would be increased (this is a simple way to capture the intuition that words in the title are a bit more important).
	Running instructions:
	simply running the scripts nnmf_no_datatreatment.py or nnmf_noun_extraction.py would run both scripts, displaying results and times. The data.jason file should be placed in the same directory.
	The problem was broken down into three subtasks: (i) loading and per-processing the data, (ii) running a tf-idf algorithm on the data, (iii) running a non-negative frequency factorization of the tf-idf matrix. 
	Step (i)
	Data Loading:

For both solution, data loading is identical. The json file is loaded, and for each article, we generate a string containing the title duplicated and the content of the file. Each string is them appended to an empty list. The resulting list contains each documents as rows (vectors).
	Preprocessing:
	For the solution in nnmf_no_datatreatment.py, no data treatment was performed. The aim was to have a comparison between the pros and cons of pre-processing options.
	For the solution in  nnmf_ noun_extraction.py, we created a filter that tokenizes, performs POS tagging on the text and filters it by removing all but noun elements. Each document corresponding to each element in the list can be regarded as a bag-of-nouns. Many authors have proposed that nouns are the linguistic units that best represent the topics of a text (for a recent example, Mihancea 2004, but many others). Personal note: I thought that verbs in the gerund (e.g. ‘bombing’), which are subject to nomilaziation rules, e.g. ‘the bombing in King’s Cross’, have a similar nouny nature, but unfortunately results are not so interesting.
	Step (ii) TF*IDF

Roughly, TF-IDF is the ratio between two components: (i) the term frequency of a term t in a document d –  tf(t,d) –  and (ii) inverse document frequency- idf(t,D) – which captures the inverse  frequency of a term in a corpus. So if ‘the’ appears many times, but it also appears many times in a corpus, its TF-IDF value will be low. A term that is frequent in a document, but not very frequent in a corpus will have a higher score, which means that it is statistically significant (for discussion, see Salton and Buckley 1988)

.
	Implementation of this step involved just calls to the sklearn.feature_extraction.text.CountVectorizer module on our data. A good documentation can be found on the library’s website:  http://scikit-learn.org/stable/modules/feature_extraction.html . 
	Note: max_df was set to 0.95, discarding the 5% most frequent terms (the resulting data is a bit noisy, e.g. lots of pronouns) and  binary=False (whilst setting it to True is useful for small chunks of text, it didn’t work well for our data). 
	Step (iii) Non-Negative Matrix factorization 
	Non-negative Matrix Factorization (NNMF)
can be user as a technique for reducting the complexity of the analysis of a term-document matrix D (as in tf*idf), hence some problems in information retrieval (see Chang et al. 2002).  Then the original D matrix is approximated by the product of two low rank matrices: the Document-Topic matrix W and the
Topic-Word matrix H.
The rank of the matrices is defined by the number k of topics.  wd(i,k) in W represents the importance of the topic k to the document
i and wt(k,j) represents the importance of the term j to the topic k. My implementation was a re-work of the examples given in the sklearn’s library website (ww.sci-kit.org)
	Discussion:
	As expected, solution 2, which extracts noun phrases first, takes much longer  (more than 1 minute) than the solution which does not use noun extraction (less than 3 secs), but it has much cleaner data. 
	As a final note, I like to point out that the solutions presented here are not exhaustive. Graph based hybrid solutions like the text-rank algorithm give very good results. This hasn’t been presented because (i) my unfamiliarity with graph based information retrieval, and (ii) the availability of a very good implementation which would only require me to load the data (you can find it here https://gist.github.com/1646117 , too good to be true really). 
  

## nnmf_no_datatreatment.py
#DISCLAIMER: This is a re-work of the examples provided in the re-work of
#the examples given in the sklearn's library website (ww.sci-kit.org)
#most of the work are done by the libraries themselves
#thanks for the working example given by Olivier Grisel <olivier.grisel@ensta.org>

import simplejson as json
import nltk
from time import time
from sklearn.feature_extraction import text
from sklearn import decomposition

#setting up initial time for benchmark purposes
initialtime = time()
#loading of initial variables and data
corpus = [] #initializing a corpus variable
data = json.load(open('data.json')) # loads the data into a file


#creating the corpus matrix
print "Loading data..."

for article in data["articles"]:
  print "Loading article  entitled: " + article["title"]
  document = (article["title"] + " " + article["title"] + " " + article["content"]) # note: titles are added twice, so they can be counted as more significant when calculating the requency probability matrix
  corpus.append(document)
print "The total number of documents in our corpus is: " + str(len(corpus) +1)
components =  len(corpus)

print "Calculating term frequency by inverse document frequency matrix..."
vectorizer = text.CountVectorizer(max_df=0.95,max_features=1000,binary=False)
counts = vectorizer.fit_transform(corpus)
tfidf = text.TfidfTransformer().fit_transform(counts)


#vectorizer = text.TfidfVectorizer(max_df=0.95, max_features=750, binary=False) #This excludes the 5% top words
#print vectorizer.vocabulary_ # uncomment to see the vocabulary
#tfidf= vectorizer.fit_transform(corpus) #this calculates the frequency matrix for the words in the corpus

print "Calculating non-negative matrix factorization..."
#n_interestingwords = 10 # sets the number of more interesting words
nmf = decomposition.NMF(n_components=components)
nmf = nmf.fit(tfidf)
features = vectorizer.get_feature_names()
print "These are the documents extracted from the data and their respective topics:"
#nmf.components_

for topicidx, topic in enumerate(nmf.components_):
    print "Keywords : " + " ".join([features[i] for i in topic.argsort()[:-11:-1]]) # this gives the top 10 keywords associated with a given topic
print "Data extraction took: " + str(time() - initialtime) + "s to complete!"

## nnmf_noun_extraction
#DISCLAIMER: This is a re-work of the examples provided in the re-work of
#the examples given in the sklearn's library website (ww.sci-kit.org)
#most of the work are done by the libraries themselves
#thanks for the working example given by Olivier Grisel <olivier.grisel@ensta.org>


import simplejson as json
import nltk
from time import time
from sklearn.feature_extraction import text
from sklearn import decomposition

#setting up initial time for benchmark purposes
initialtime = time()


#Data treatment component:
#General procedures for data treatment
def postagfilter(tagged, tags=['NN','NNP']): # this filters pairs whose second element is a noun related POS
    return [item[0] for item in tagged if item[1] in tags]

def normalize(tagged): # this removes punctuation left by the word punkt tokenizer used to clean the initial data
    return [item.replace('.', '') for item in tagged]

#loading of initial variables and data
corpus = [] #initializing a corpus variable
data = json.load(open('data.json')) # loads the data into a file

print "Loading data..."
#creating the corpus matrix
for article in data["articles"]:
  document = (article["title"] + " " + article["title"] + " " + article["content"])
  tokensdocument = nltk.PunktWordTokenizer().tokenize(document)
  tagsdocument = nltk.pos_tag(tokensdocument)
  filtereddocument = postagfilter(tagsdocument)
  bagofnouns = ' '.join(normalize(filtereddocument))
  corpus.append(bagofnouns)

components =  len(corpus)

print "Calculating term frequency by inverse document frequency matrix..."
vectorizer = text.TfidfVectorizer(max_df=0.95, max_features=500, binary=False) #note, you have to set this to false, otherwise this leads to a feature explosion, when calculating NNMF
#print vectorizer.vocabulary_ # uncomment this, if you want to see the vocabulary
tfidf= vectorizer.fit_transform(corpus) #this calculates the frequency matrix for the words in the corpus

print "Calculating non-negative matrix factorization..."

nmf = decomposition.NMF(n_components=components)
nmf = nmf.fit(tfidf)
features = vectorizer.get_feature_names() # potential problem here
print "This are the documents extracted from the data and their respective topics:"
for topicidx, topic in enumerate(nmf.components_):
    print "Keywords : " + " ".join([features[i] for i in topic.argsort()[:-11:-1]]) # this gives the top 10 keywords associated with a given topic

print "Data extraction took: " + str(time() - initialtime) + "s to complete!"
	#DISCLAIMER: This is a re-work of the examples provided in the re-work of
	#the examples given in the sklearn's library website (ww.sci-kit.org)
	#most of the work are done by the libraries themselves
	#thanks for the working example given by Olivier Grisel <olivier.grisel@ensta.org>

	import simplejson as json
	import nltk
	from time import time
	from sklearn.feature_extraction import text
	from sklearn import decomposition

	#setting up initial time for benchmark purposes
	initialtime = time()
	#loading of initial variables and data
	corpus = [] #initializing a corpus variable
	data = json.load(open('data.json')) # loads the data into a file


	#creating the corpus matrix
	print "Loading data..."

	for article in data["articles"]:
	print "Loading article entitled: " + article["title"]
	document = (article["title"] + " " + article["title"] + " " + article["content"]) # note: titles are added twice, so they can be counted as more significant when calculating the requency probability matrix
	corpus.append(document)
	print "The total number of documents in our corpus is: " + str(len(corpus) +1)
	components = len(corpus)

	print "Calculating term frequency by inverse document frequency matrix..."
	vectorizer = text.CountVectorizer(max_df=0.95,max_features=1000,binary=False)
	counts = vectorizer.fit_transform(corpus)
	tfidf = text.TfidfTransformer().fit_transform(counts)



	#vectorizer = text.TfidfVectorizer(max_df=0.95, max_features=750, binary=False) #This excludes the 5% top words
	#print vectorizer.vocabulary_ # uncomment to see the vocabulary
	#tfidf= vectorizer.fit_transform(corpus) #this calculates the frequency matrix for the words in the corpus

	print "Calculating non-negative matrix factorization..."
	#n_interestingwords = 10 # sets the number of more interesting words
	nmf = decomposition.NMF(n_components=components)
	nmf = nmf.fit(tfidf)
	features = vectorizer.get_feature_names()
	print "These are the documents extracted from the data and their respective topics:"
	#nmf.components_

	for topicidx, topic in enumerate(nmf.components_):
	print "Keywords : " + " ".join([features[i] for i in topic.argsort()[:-11:-1]]) # this gives the top 10 keywords associated with a given topic
	print "Data extraction took: " + str(time() - initialtime) + "s to complete!"