Created
July 24, 2012 14:58
-
-
Save frendhisaido/3170455 to your computer and use it in GitHub Desktop.
TF-IDF
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2012-04-02T06:52:32Z||oprator berpengalaman telkomsel selain excel indosat saya mmbutuhkan operator marketing yang ckp handal berpengalaman | |
2012-04-02T07:12:42Z||rt pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan | |
2012-04-02T07:00:12Z||pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan jaringan 5g dijamin internetan wusshhhhh | |
2012-04-02T07:03:31Z||edit jalur akses internet indosat gunakan proxy ip add 195 189 142 132 port ip 80 yang lain biarkan seperti aslinya | |
2012-04-02T06:56:49Z||haha <makian> oprator berpengalaman telkomsel selain indosat bth operator marketing yang pengalaman | |
2012-04-02T07:22:10Z||rt pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan jaringan 5g dijamin internetan wusshhhhh | |
2012-04-02T07:32:05Z||di atmajaya abisnya mirip sangat aula indosat gambarnya tadi haha | |
2012-04-02T07:28:44Z||saya cinta karo indosat mergo terpaksa | |
2012-04-02T09:43:10Z||pan sarua indosat mnh haha dibawain ngan peje hela ngke hayu wk | |
2012-04-02T11:24:16Z||euw indosat should fix their bad connection | |
2012-04-02T12:57:58Z||gadeliv deliv acan eleuh eleuh indosat tahun meni geleuh | |
2012-04-02T12:54:53Z||rt indosat good cute ads reasonable price but can even pakai single call so hope they have fire insurance | |
2012-04-02T12:52:18Z||indosat good cute ads reasonable price but can even pakai single call so hope they have fire insurance | |
2012-04-02T15:07:35Z||2rts ktupat aya bsi ek aya 550 e63 hde kneh brow msi hp indosat wii hyong symbian pguh hp masuk kneh n0 | |
2012-04-02T15:16:57Z||senyumlicik penghianat yaps beralih ke indosat maybe it better than | |
2012-04-02T15:51:08Z||giliran lancar teman teman saya pada tidur terimakasih indosat | |
2012-04-02T16:18:37Z||disappointing with indosat internet connection slow it has been like this week | |
2012-04-02T08:01:02Z||pc laptop handphone barang impor operator selular indosat xl telkomsel milik asing qatar singapur malaysia | |
2012-04-02T12:58:14Z||pakai sarung tangan ngerakit kabel22 pasang petasan otw gedung indosat kedipin mata 2kali bom duarr tetap gdlv3 | |
2012-04-02T08:11:36Z||rt pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan jaringan 5g | |
2012-04-02T12:54:27Z||they selling unlimited that really limit our call hello where have ylki they still alive indosat good cute | |
2012-04-02T13:04:54Z||indosat good cute ads reasonable price but can even pakai single call so hope they have fire insurance | |
2012-04-02T14:41:12Z||reservation southeast asia official phone number 62 856 2121 666 indosat official blackberry | |
2012-04-02T12:49:19Z||indosat good cute ads reasonable price but can even pakai single call so hope they have fire insurance | |
2012-04-02T15:53:15Z||sama2 giliran lancar teman teman saya pada tidur terimakasih indosat | |
2012-04-02T13:00:46Z||ah jan sikak oq gra2 lali pngaturane njuk seg indosat ra keno gawe bka fb | |
2012-04-02T13:27:06Z||perbulannya berapa ini min pakai indosat internet broom bisa internetan cepat tanpa putus didukung dengan jaringan 5g | |
2012-04-02T14:09:59Z||guess indosat android worst combination former were slow til now latter eating enormous bytes | |
2012-04-02T15:15:07Z||penghianat yaps beralih ke indosat maybe it better than | |
2012-04-02T15:22:11Z||adele old friends why so shy me indosat why so bad | |
2012-04-02T11:25:13Z||walaupun hujan deras gini sinyal 3g indosat dirumah saya tetap kuat | |
2012-04-02T12:54:57Z||iklan indosat eneg tiru2 genkisudo huek | |
2012-04-02T14:11:55Z||tetap saja indosat abaaaaal haha | |
2012-04-02T13:53:53Z||rt apa definisi sukses menurut teman teman pakai indosat mobile | |
2012-04-02T14:09:17Z||haha tidak-ada kerjaan waktu ngerjain operator indosat | |
2012-04-02T17:22:28Z||dang saat mau buka koran ternyata ada iklan indosat haha suka shock begitu saya | |
2012-04-02T16:44:54Z||euweuh ka urg nte geus diaktifkeun can rhie zoel zul aya telepon ti indosat jang ngaaktfkeun kartu prabayar tea | |
2012-04-02T16:36:48Z||people who complain about indosat services like who complain about getting aids whore they knew had aids | |
2012-04-02T10:30:52Z||asik puas internetan pakai indosat internet broom gas pool ngebuut |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
oprator=2.9444389791664403; yang=2.5649493574615367; saya=1.791759469228055; marketing=2.9444389791664403; selain=2.9444389791664403; telkomsel=2.5649493574615367; berpengalaman=5.8888779583328805; operator=2.1972245773362196; cepat=1.9459101490553132; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; rt=1.9459101490553132; bisa=1.9459101490553132; pakai=1.0986122886681098; internetan=1.791759469228055; internet=1.3862943611198906; tanpa=1.9459101490553132; putus=1.9459101490553132; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; wusshhhhh=2.9444389791664403; internetan=3.58351893845611; cepat=1.9459101490553132; 5g=2.1972245773362196; dijamin=2.9444389791664403; bisa=1.9459101490553132; pakai=1.0986122886681098; internet=1.3862943611198906; jaringan=2.1972245773362196; tanpa=1.9459101490553132; putus=1.9459101490553132; yang=2.5649493574615367; internet=1.3862943611198906; oprator=2.9444389791664403; yang=2.5649493574615367; marketing=2.9444389791664403; selain=2.9444389791664403; haha=1.791759469228055; telkomsel=2.5649493574615367; berpengalaman=2.9444389791664403; operator=2.1972245773362196; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; wusshhhhh=2.9444389791664403; internetan=3.58351893845611; cepat=1.9459101490553132; 5g=2.1972245773362196; dijamin=2.9444389791664403; rt=1.9459101490553132; bisa=1.9459101490553132; pakai=1.0986122886681098; internet=1.3862943611198906; jaringan=2.1972245773362196; putus=1.9459101490553132; tanpa=1.9459101490553132; haha=1.791759469228055; saya=1.791759469228055; haha=1.791759469228055; connection=2.9444389791664403; bad=2.9444389791664403; insurance=2.1972245773362196; they=1.791759469228055; call=1.9459101490553132; but=2.1972245773362196; single=2.1972245773362196; can=1.9459101490553132; have=1.9459101490553132; so=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; fire=2.1972245773362196; reasonable=2.1972245773362196; price=2.1972245773362196; even=2.1972245773362196; rt=1.9459101490553132; pakai=1.0986122886681098; ads=2.1972245773362196; hope=2.1972245773362196; insurance=2.1972245773362196; they=1.791759469228055; call=1.9459101490553132; but=2.1972245773362196; single=2.1972245773362196; can=1.9459101490553132; have=1.9459101490553132; so=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; fire=2.1972245773362196; reasonable=2.1972245773362196; price=2.1972245773362196; even=2.1972245773362196; pakai=1.0986122886681098; ads=2.1972245773362196; hope=2.1972245773362196; aya=5.8888779583328805; penghianat=2.9444389791664403; it=2.5649493574615367; maybe=2.9444389791664403; ke=2.9444389791664403; yaps=2.9444389791664403; better=2.9444389791664403; beralih=2.9444389791664403; than=2.9444389791664403; lancar=2.9444389791664403; saya=1.791759469228055; tidur=2.9444389791664403; giliran=2.9444389791664403; teman=5.1298987149230735; terimakasih=2.9444389791664403; pada=2.9444389791664403; connection=2.9444389791664403; it=2.5649493574615367; slow=2.9444389791664403; like=2.9444389791664403; internet=1.3862943611198906; telkomsel=2.5649493574615367; operator=2.1972245773362196; pakai=1.0986122886681098; tetap=2.5649493574615367; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; internetan=1.791759469228055; cepat=1.9459101490553132; 5g=2.1972245773362196; rt=1.9459101490553132; bisa=1.9459101490553132; pakai=1.0986122886681098; internet=1.3862943611198906; tanpa=1.9459101490553132; putus=1.9459101490553132; jaringan=2.1972245773362196; call=1.9459101490553132; they=3.58351893845611; have=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; insurance=2.1972245773362196; they=1.791759469228055; call=1.9459101490553132; but=2.1972245773362196; single=2.1972245773362196; can=1.9459101490553132; have=1.9459101490553132; so=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; fire=2.1972245773362196; reasonable=2.1972245773362196; price=2.1972245773362196; even=2.1972245773362196; pakai=1.0986122886681098; ads=2.1972245773362196; hope=2.1972245773362196; insurance=2.1972245773362196; they=1.791759469228055; call=1.9459101490553132; but=2.1972245773362196; single=2.1972245773362196; can=1.9459101490553132; have=1.9459101490553132; so=1.9459101490553132; good=1.9459101490553132; cute=1.9459101490553132; fire=2.1972245773362196; reasonable=2.1972245773362196; price=2.1972245773362196; even=2.1972245773362196; pakai=1.0986122886681098; ads=2.1972245773362196; hope=2.1972245773362196; lancar=2.9444389791664403; saya=1.791759469228055; tidur=2.9444389791664403; giliran=2.9444389791664403; teman=5.1298987149230735; terimakasih=2.9444389791664403; pada=2.9444389791664403; cepat=1.9459101490553132; 5g=2.1972245773362196; broom=1.791759469228055; dengan=1.9459101490553132; didukung=1.9459101490553132; bisa=1.9459101490553132; pakai=1.0986122886681098; internetan=1.791759469228055; internet=1.3862943611198906; putus=1.9459101490553132; tanpa=1.9459101490553132; jaringan=2.1972245773362196; slow=2.9444389791664403; penghianat=2.9444389791664403; it=2.5649493574615367; maybe=2.9444389791664403; ke=2.9444389791664403; yaps=2.9444389791664403; better=2.9444389791664403; beralih=2.9444389791664403; than=2.9444389791664403; so=3.8918202981106265; bad=2.9444389791664403; saya=1.791759469228055; tetap=2.5649493574615367; iklan=2.9444389791664403; haha=1.791759469228055; tetap=2.5649493574615367; rt=1.9459101490553132; teman=5.1298987149230735; pakai=1.0986122886681098; haha=1.791759469228055; operator=2.1972245773362196; iklan=2.9444389791664403; saya=1.791759469228055; haha=1.791759469228055; can=1.9459101490553132; aya=2.9444389791664403; they=1.791759469228055; like=2.9444389791664403; broom=1.791759469228055; internetan=1.791759469228055; pakai=1.0986122886681098; internet=1.3862943611198906; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pakai=1.0986122886681096, df=12 | |
internet=1.3862943611198906, df=8 | |
broom=1.7917594692280547, df=6 | |
haha=1.7917594692280547, df=6 | |
saya=1.7917594692280547, df=6 | |
bisa=1.945910149055313, df=5 | |
call=1.945910149055313, df=5 | |
can=1.945910149055313, df=5 | |
cepat=1.945910149055313, df=5 | |
cute=1.945910149055313, df=5 | |
dengan=1.945910149055313, df=5 | |
didukung=1.945910149055313, df=5 | |
good=1.945910149055313, df=5 | |
have=1.945910149055313, df=5 | |
putus=1.945910149055313, df=5 | |
rt=1.945910149055313, df=5 | |
tanpa=1.945910149055313, df=5 | |
they=2.0903860474327307, df=6 | |
5g=2.1972245773362196, df=4 | |
ads=2.1972245773362196, df=4 | |
but=2.1972245773362196, df=4 | |
even=2.1972245773362196, df=4 | |
fire=2.1972245773362196, df=4 | |
hope=2.1972245773362196, df=4 | |
insurance=2.1972245773362196, df=4 | |
jaringan=2.1972245773362196, df=4 | |
operator=2.1972245773362196, df=4 | |
price=2.1972245773362196, df=4 | |
reasonable=2.1972245773362196, df=4 | |
single=2.1972245773362196, df=4 | |
so=2.335092178866376, df=5 | |
internetan=2.3890126256374065, df=6 | |
it=2.5649493574615367, df=3 | |
telkomsel=2.5649493574615367, df=3 | |
tetap=2.5649493574615367, df=3 | |
yang=2.5649493574615367, df=3 | |
bad=2.9444389791664403, df=2 | |
beralih=2.9444389791664403, df=2 | |
better=2.9444389791664403, df=2 | |
connection=2.9444389791664403, df=2 | |
dijamin=2.9444389791664403, df=2 | |
giliran=2.9444389791664403, df=2 | |
iklan=2.9444389791664403, df=2 | |
ke=2.9444389791664403, df=2 | |
lancar=2.9444389791664403, df=2 | |
like=2.9444389791664403, df=2 | |
marketing=2.9444389791664403, df=2 | |
maybe=2.9444389791664403, df=2 | |
oprator=2.9444389791664403, df=2 | |
pada=2.9444389791664403, df=2 | |
penghianat=2.9444389791664403, df=2 | |
selain=2.9444389791664403, df=2 | |
slow=2.9444389791664403, df=2 | |
terimakasih=2.9444389791664403, df=2 | |
than=2.9444389791664403, df=2 | |
tidur=2.9444389791664403, df=2 | |
wusshhhhh=2.9444389791664403, df=2 | |
yaps=2.9444389791664403, df=2 | |
aya=4.41665846874966, df=2 | |
berpengalaman=4.41665846874966, df=2 | |
teman=5.1298987149230735, df=3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package dataConvert; | |
import java.io.*; | |
import java.util.*; | |
import java.util.Map.Entry; | |
/** | |
* Program hitung TFIDF | |
* | |
* @author frendhisaidodanaro | |
*/ | |
public class procTFIDF { | |
//Array untuk pengecekan stop word. | |
private ArrayList<String> alExtStopWords = new ArrayList<String>(); | |
// Fungsi sorting TreeMap berdasarkan value. | |
static <K,V extends Comparable<? super V>> SortedSet<Map.Entry<K,V>> entriesSortedByValues(Map<K,V> map) { | |
SortedSet<Map.Entry<K,V>> sortedEntries = new TreeSet<Map.Entry<K,V>>( | |
new Comparator<Map.Entry<K,V>>() { | |
@Override public int compare(Map.Entry<K,V> e1, Map.Entry<K,V> e2) { | |
int res = e1.getValue().compareTo(e2.getValue()); | |
return res != 0 ? res : 1; | |
} | |
} | |
); | |
sortedEntries.addAll(map.entrySet()); | |
return sortedEntries; | |
} | |
//Snippet dari program edu.upi.cs.tweetmining.TFIDF untuk memasukkan data stopwords ke array alExtStopWords | |
private void loadExtStopWords(String inputExtStopWords) { | |
try { | |
FileInputStream fstream = new FileInputStream(inputExtStopWords); | |
DataInputStream in = new DataInputStream(fstream); | |
BufferedReader br = new BufferedReader(new InputStreamReader(in)); | |
String strLine; | |
int cc=0; | |
while ((strLine = br.readLine()) != null) { | |
alExtStopWords.add(strLine); | |
} | |
br.close(); | |
in.close(); | |
}catch (Exception e) { | |
System.out.println(e.toString()); | |
} | |
} | |
public void process(String fileInput, String extStopWord, boolean denganStat) { | |
String namaFile = fileInput.substring(0, fileInput.indexOf(".")); | |
int totalTerms = 0; | |
int totalDoc; | |
// mulai load stopwords ke arrayExtStopWords. | |
loadExtStopWords(extStopWord); | |
// | |
ArrayList<HashMap<String, Integer>> arrTweets = new ArrayList<HashMap<String, Integer>>(); | |
ArrayList<HashMap<String, Double>> arrTFIDF = new ArrayList<HashMap<String, Double>>(); | |
HashMap<String, Integer> docFreq = new HashMap<String, Integer>(); | |
TreeMap<String, Double> tfIDF = new TreeMap<String, Double>(); | |
try{ | |
FileInputStream fstream = new FileInputStream(fileInput); | |
DataInputStream in = new DataInputStream(fstream); | |
BufferedReader br = new BufferedReader(new InputStreamReader(in)); | |
System.out.println("Reading "+ fileInput); | |
// HITUNG TERM FREQUENCY | |
// Membaca file input | |
// Mencari jumlah tf tiap term per baris | |
String strLine; | |
Integer tfreq; | |
while ((strLine = br.readLine()) != null) { | |
HashMap<String, Integer> termFreq = new HashMap<String, Integer>(); | |
String docn = strLine.substring(22,strLine.length()); | |
Scanner sc = new Scanner(docn); | |
while(sc.hasNext()) { | |
String term = sc.next(); | |
if(!term.equalsIgnoreCase("indosat")){ //Skip keyword indosat, karena ada di setiap tweet. | |
tfreq = termFreq.get(term); //Ambil value | |
termFreq.put(term, (tfreq == null) ? 1 : tfreq + 1); //Jika value masih kosong, isi 1. Jika 1, increment. | |
totalTerms++; | |
} | |
} | |
sc.close(); | |
arrTweets.add(termFreq);//Simpan termFreq. | |
} | |
br.close(); | |
// Selesai membaca dataset. | |
// arrTweet berisi HashMap termFreq, tiap termFreq adalah representasi dokumen/tweet, berisi jumlah tf dari masing2 term. | |
// HITUNG DOCUMENT FREQUENCY | |
// Iterasi arrTweets, untuk menghitung df. | |
// Menghitung jumlah dokumen yang mengandung term. | |
// docFreq.put("awan",7) | |
// Artinya term "awan", ditemukan di 7 dokumen/tweet | |
Iterator iterArray = arrTweets.iterator(); | |
while(iterArray.hasNext()){ | |
HashMap perTweet = (HashMap) iterArray.next(); | |
Iterator iterEach = perTweet.keySet().iterator(); | |
while(iterEach.hasNext()){ | |
String eachW = (String) iterEach.next(); | |
if(alExtStopWords.contains(eachW)){ //Kalau ada di stopword, DF = 0. | |
docFreq.put(eachW, 0); | |
}else{ | |
Integer dfreq = docFreq.get(eachW); | |
docFreq.put(eachW,(dfreq == null)? 1 : dfreq +1 ); | |
} | |
} | |
} | |
// Selesai menghitung DF tiap term | |
// HashMap docFreq berisi key= term, value= document frequency | |
// HITUNG IDF dan TFIDF | |
// arrTweets sekali lagi di iterasi | |
// untuk menghitung nilai IDF lalu sekaligus dihitung TF*IDF nya | |
// di tiap dokumen nilai TF*IDF per term dihitung, dan disimpan di HashMap valTFIDF | |
// lalu valTFIDF ini dikumpulkan di arrTFIDF,\ | |
Iterator iterTF = arrTweets.iterator(); | |
Double idf,tfidf; | |
totalDoc = arrTweets.size(); | |
while(iterTF.hasNext()){ | |
HashMap<String, Double> valTFIDF = new HashMap<String, Double>(); | |
HashMap perTweet = (HashMap) iterTF.next(); | |
Iterator iterEach = perTweet.keySet().iterator(); | |
while(iterEach.hasNext()){ | |
String aTerm = (String) iterEach.next(); //ambil term yang akan diproses | |
Integer dfreq = docFreq.get(aTerm); //ambil nilai DF dari term yang akan diproses | |
if(dfreq>1){ | |
Integer cfreq = (Integer) perTweet.get(aTerm); // ambil nilai tf dari aTerm | |
idf = Math.log(totalDoc/dfreq); | |
tfidf = cfreq * idf; | |
valTFIDF.put(aTerm, tfidf); | |
//System.out.println("TFIDF("+aTerm+")= "+cfreq+" * "+"log("+totDoc+"/"+dfreq+") = "+ tfidf+" , "); | |
} | |
} | |
arrTFIDF.add(valTFIDF); //Selesai olah satu perTweet, simpan HashMap valTFIDF ke arrTFIDF | |
} | |
// Selesai hitung IDF dan TF*IDF | |
// arrTFIDF berisi nilai tfidf tiap term per dokumen, yaitu valTFIDF | |
// Tulis hasil hitung TF*IDF ke file output namafile_tfidf.txt | |
BufferedWriter writeTFIDF = new BufferedWriter(new FileWriter( (namaFile+"_tfidf.txt") ,true)); | |
Iterator iterValTFIDF = arrTFIDF.iterator(); | |
while(iterValTFIDF.hasNext()){ | |
HashMap perTweet = (HashMap) iterValTFIDF.next(); | |
//System.out.println(perTweet.toString()); | |
Iterator iterEach = perTweet.keySet().iterator(); | |
while(iterEach.hasNext()){ | |
String aTerm = (String) iterEach.next(); | |
Double valTFIDF = (Double) perTweet.get(aTerm); | |
writeTFIDF.write(aTerm+"="+valTFIDF+"; "); | |
//System.out.print(aTerm+"="+valTFIDF+"; "); | |
} | |
//System.out.println("__"); | |
//writeTFIDF.newLine(); | |
} | |
writeTFIDF.close(); | |
// Hitung rata-rata bobot TFIDF term, jika denganStat= true | |
if(denganStat){ | |
// HITUNG jumlah rata2 TFIDF tiap term | |
for(String word : docFreq.keySet()){ | |
Integer dfreq = docFreq.get(word); | |
if(dfreq>1){ //hanya hitung term yang muncul di lebih dari satu dokumen | |
//System.out.println("Collecting term: "+word+" df= "+dfreq); | |
Double tfIDFstat = 0.0; // Inisiasi nilai tfIDFstat, digunakan untuk akumulasi | |
int cc=0; | |
Iterator iterTFIDF = arrTFIDF.iterator(); | |
while(iterTFIDF.hasNext()) { | |
HashMap val = (HashMap) iterTFIDF.next(); | |
if(val.containsKey(word)){ | |
for(Object t : val.keySet()) { | |
if(t.toString().equals(word)){ | |
cc++; | |
tfIDFstat = tfIDFstat + (Double) val.get(word); //akumulasi nilai tfidf suatu term di seluruh dokumen | |
} | |
} | |
} | |
} | |
//System.out.println("Counted="+cc+" tfIDFstats="+tfIDFstat); | |
Double tfIDFtot = tfIDFstat/cc; //HITUNG RATA-RATA | |
//System.out.println("tfidf("+word+")="+tfIDFtot); | |
tfIDF.put(word, tfIDFtot); //Simpan di TreeMap tfIDF | |
} | |
} | |
// Tulis hasil hitung rata-rata ke file output namafile_tfidf_stat.txt | |
BufferedWriter writeStat = new BufferedWriter(new FileWriter( (namaFile+"_tfidf_stat.txt") ,true)); | |
for (Iterator<Entry<String, Double>> it = entriesSortedByValues(tfIDF).iterator(); it.hasNext();) { | |
Entry<String, Double> entry = it.next(); | |
String oneWord = entry.getKey(); | |
Double oneValue = entry.getValue(); | |
Integer dfreq= docFreq.get(oneWord); | |
//System.out.println("tdidf("+oneWord+")= "+oneValue); | |
writeStat.write(oneWord+"="+oneValue+", df="+dfreq); | |
writeStat.newLine(); | |
} | |
writeStat.close(); | |
} | |
}catch(Exception e){ | |
System.out.println(e.toString()); | |
} | |
System.out.println("unik: "+docFreq.size()); | |
System.out.println("Jumlah document:"+ arrTweets.size()); | |
System.out.println("Total term: "+totalTerms); | |
} | |
public static void main(String[] a) { | |
procTFIDF pt = new procTFIDF(); | |
pt.process("negatif_2012.txt", "catatan_stopwords_ekstensif.txt", true); | |
} | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bagian code itu fungsinya hanya untuk untuk mengisi daftar stop-words (
ArrayList<String> alExtStopWords
) dari file.Kalau stopwords nya cukup di-"hardcode" mungkin bagian code itu tidak perlu.
Daftar stopwords ini nanti digunakan saat menghitung DF:
https://gist.github.com/frendhisaido/3170455#file-proctfidf-java-L109
Stopwords tidak dihitung Document Frequency nya https://gist.github.com/frendhisaido/3170455#file-proctfidf-java-L109
Maaf karena sudah 8 tahun yang lalu jadi agak lupa pastinya,
tapi seingat saya dulu untuk TF-IDF stopwords tidak perlu dihitung karena (mungkin) tidak ada nilai sentimennya.
Jadi supaya tidak beri pengaruh banyak ke klasifikasinya, stopwords di skip.
Rujukan dari blog dosen saya: https://yudiwbs.wordpress.com/2008/07/23/stop-words-untuk-bahasa-indonesia/