Skip to content

Instantly share code, notes, and snippets.

View MSDarshan91's full-sized avatar

Darshan M.S. MSDarshan91

  • Wunderflats
  • Düsseldorf, Germany
View GitHub Profile
@MSDarshan91
MSDarshan91 / stopwords-kn.txt
Last active January 12, 2021 07:38
Stopwords for Kannada
ಮತ್ತು
ಒಂದು
ರಲ್ಲಿ
ಹಾಗೂ
ಎಂದು
ಅಥವಾ
ಇದು
ಅವರು
<!DOCTYPE html>
<html>
<body>
<h1>Extracting Skills from Personal Communication Data using StackExchange Dataset</h1>
<p>In this blog, we will see how to make use of the stack exchange publicly available dump to extract skills from the communication data.
First, download the entire stack exchange dataset.
The entire stackexchange dataset can be downloaded <a href=" https://archive.org/details/stackexchange">here</a>. There are many stackexchange websites like stackoverflow,cs, datascience, physics, history and so on. One can download the necessary compressed files or one can download the entire dump using torrents. Since, we were using linux on openstack framework, we had to download the torrent files from the terminal and more information about downloading the torrent files from command line is <a href="https://www.learn2crack.com/2013/10/download-torrent-using-terminal.html">here</a>. After downloading the files extract the 7z files (Can be done in one script). Each 7z file corresponds to a stackexchange
import csv
import random
import math
import operator
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
with open(filename, 'rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):