Skip to content

Instantly share code, notes, and snippets.

@MSDarshan91
Created November 30, 2016 20:59
Show Gist options
  • Save MSDarshan91/ba17094ae8c62f518a3580f6e72d7c02 to your computer and use it in GitHub Desktop.
Save MSDarshan91/ba17094ae8c62f518a3580f6e72d7c02 to your computer and use it in GitHub Desktop.
<!DOCTYPE html>
<html>
<body>
<h1>Extracting Skills from Personal Communication Data using StackExchange Dataset</h1>
<p>In this blog, we will see how to make use of the stack exchange publicly available dump to extract skills from the communication data.
First, download the entire stack exchange dataset.
The entire stackexchange dataset can be downloaded <a href=" https://archive.org/details/stackexchange">here</a>. There are many stackexchange websites like stackoverflow,cs, datascience, physics, history and so on. One can download the necessary compressed files or one can download the entire dump using torrents. Since, we were using linux on openstack framework, we had to download the torrent files from the terminal and more information about downloading the torrent files from command line is <a href="https://www.learn2crack.com/2013/10/download-torrent-using-terminal.html">here</a>. After downloading the files extract the 7z files (Can be done in one script). Each 7z file corresponds to a stackexchange website. Since we were interested only in technical websites, we deleted 7z files corresponding to websites like japanese.stackexchange, spanish.stackexchange and so on. Now, we have the stack exchange knowledge base. </p>
<p>Next, process and index the posts:
We will only be using the Posts.xml file in every folder. A post in stack exchange is either a question, answer or a comment. Each post will be associated with a set of tags. We consider these tags as skills. We use this stack exchange knowledge base as a training set to predict the tags. In this project, we decided to implement a K-NN multi label classification model using lucene. Lucene is a text search engine written in java. But we will be using pyLucene, which is a python wrapper to lucene. A very nice explanation of setting up pylucene is given <a href="http://bendemott.blogspot.fi/2013/11/installing-pylucene-4-451.html">here</a>. To build this search engine, first we need to index all the posts with two fields 'text' and 'tags'. Some pre processing is done before indexing the 'text'. This process is done over all folders and indexed into one file system. </p>
<p>
If we are given a set of messages (from an instant messaging platform) of an individual, the task is to predict the tags for each message. A message is used as a query to the search engine. The searching is done on the 'text' field. A score is associated with every tag and it is initialized to zero. We retrieve top k most similar posts associated with the message along with the similarity value and tags. The similarity value is added to the tags. This is done for each message and finally the tags which have larger values are declared as the skills of an individual. </p>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment