pypt/Implement topic creation using machine learning for MediaCloud.md

## Implement topic creation using machine learning for MediaCloud.md

      
    Raw
  

              Implement topic creation using machine learning for MediaCloud.md
            
          
    I have contributed to the Media Cloud in this summer as my Google Summer of Code project 2017.
Overview

Project Background: Host Organisation

The Berkman Klein Center for Internet & Society at Harvard University is dedicated to exploring, understanding, and shaping the development of the digitally-networked environment. A diverse, interdisciplinary community of scholars, practitioners, technologists, policy experts, and advocates, we seek to tackle the most important challenges of the digital age while keeping a focus on tangible real-world impact in the public interest. Our faculty, fellows, staff and affiliates conduct research, build tools and platforms, educate others, form bridges and facilitate dialogue across and among diverse communities.
Project Description

In this project, I developed a proof of concept machine learning tool for topic modelling. Specifically, it uses unsupervised machine learning algorithms to create a topic for each new article, and group articles that share the same topic together. In this way, users can not only find new articles on a chosen topic but gain a brief understanding of each article by reading its topic as well.
Project Proposal

My project proposal can be found at here.
Implementations

As discussed in the proposal, the project has three stages, namely tokenizing articles (corresponding to proposal Phase I), implementing algorithms (Phase II and III) and tuning parameters (Phase IV). All my work in this project are covered by a single Pull Request, which contains all the commits, PR reviews from my mentor and my responses.
Tokenization

In addition to breaking sentences to tokens (i.e. words), I used lemmatization to transform words of various forms into one. For example, changing 'studying' and 'studied' to 'study' or changing 'datum' and 'data' to 'datum'. Furthermore, I applied stop-word removal based on an existing stop-word list. These two procedures helped the algorithms in next phase to gain a better understanding of the articles and contribute to a higher accuracy with better performance.
Algorithms

To find out the best and most suitable way to achieve this goal, I complete three implementations of two different algorithms. The first two applications use the Latent Dirichlet Allocation Model (LDA) in two ways, while the third one applies the Non-negative Matrix Factorization Model (NMF). You may notice this is slightly different from my project proposal, where I planned to use the Support Vector Model (SVM). This is because while I was researching for this project, I discovered that LDA and NMF are more suitable for this project in theory and exhibit much better performances in practice. With the permission of my supervisor, I switched to these two instead of at the beginning of this project.
I have implemented these two algorithms in a way that it allows users to freely adjust the total number of topics and the number of words for each topic. Based on the same idea mentioned in the proposal, the first word of each topic is considered as the main topic, while the rests are defined as the subtopics around this main topic, as they often occur together.
Tuning

The most important task is to identify the optimal total number of topics in all articles to achieve the highest likelihood. While the brute-force approach (trying out every possible number of topics) appears to be straightforward and guarantees a correct result, it is extremely time-consuming hence unrealistic for a large data set (like in this project). After careful analysis of the data, I build a polynomial model to simplify the calculation. Formally, it is defined as:
$$Likelihood = a * topic_num + b * topic_num + c$$,

where $topic_num$ denotes the total number of topics in all articles and $a$ stays negative.
We can utilize this model as follows:

Generate some sample points to estimate the parameters (i.e. a, b, c). Although the more points we compute, the more accurate the model can be in general, I implemented a recursion method that always digs the most useful result and stops right after the parameters are stable. This method can save time and computation power significantly.
Locate the highest point in the polynomial graph (i.e. maximum $Likelihood$ value) by calculating derivatives.
Returns the corresponding $topic_num$ value. I also made this model adaptive to a higher order (i.e. with $topic_num^3$ or higher)

Future Work

Although all tasks planned for this project have been completed and the current program is well beyond a proof of concept as expected in the project specification, there always are things that we can do to improve.
Travis

One problem at this stage is that we cannot test the code with a larger data set on Travis due to Travis's restriction on time/size of log file. Currently, my program passed the following tests:

With a data set of 1 on my local machine;
With a larger data set on my local machine;
With a data set of 1 on Travis.

In the future, with less restriction of Travis, we could and would like to test my program on Travis with a larger dataset, or even the whole set. To assist to this, I have:

Prepared several sample data files in the same format as in the DB
Separated sample data handler from the code and made the API just like DB handler, so you can use either one without noticing the difference.
Hence only change the sample file name in sample_handler.py will allow you to test my program on those data as well.

Alpha and other parameters

Another thing we can do in the future is to tune alpha, the other parameter of the LDA model. This should not be very hard with the code I have written to tune topic_num.
Also, before using this program on a massive data set (e.g. the whole set), we could potentially hack into the packages imported to reduce time consumption further. For example, One thing I always want to do is to adjust the number of iterations to train the model dynamically (i.e. train the model for 1000 times and check if the model is stable before continuing with more iterations). I had a snippet to do that, but I had to remove it because the packages do not provide such API.
Demo and Blogs

I have also created a Prezi as a quick demonstration of my work and posted two blog posts (1, 2) on this fantastic experience of mine with Google Summer of Code on the Harvard University Geeks Blogs as well.
Acknowledgement

Finally, I would like to thank my mentor Linas Valiukas (lvaliukas@cyber.law.harvard.edu) for his consistent and helpful guidance. This project would not be able to be such as success without the efficient communication between us.
I would also like to thank the Berkman Klein Center for Internet & Society at Harvard University for hosting this project and provide other participants and me with a channel to discuss our projects and learn from each other.
Last but not least, I really appreciate Google for giving me this opportunity to work with these fantastic, talented, kind-hearted people and have such a valuable, wonderful summer.