Skip to content

Instantly share code, notes, and snippets.

@guenodz
Created February 17, 2015 11:35
Show Gist options
  • Save guenodz/d5add59b31114a3a3c66 to your computer and use it in GitHub Desktop.
Save guenodz/d5add59b31114a3a3c66 to your computer and use it in GitHub Desktop.
a simple implementation of TF-IDF algorithm in Java.
package com.guendouz.textclustering.preprocessing;
import java.util.Arrays;
import java.util.List;
/**
* @author Mohamed Guendouz
*/
public class TFIDFCalculator {
/**
* @param doc list of strings
* @param term String represents a term
* @return term frequency of term in document
*/
public double tf(List<String> doc, String term) {
double result = 0;
for (String word : doc) {
if (term.equalsIgnoreCase(word))
result++;
}
return result / doc.size();
}
/**
* @param docs list of list of strings represents the dataset
* @param term String represents a term
* @return the inverse term frequency of term in documents
*/
public double idf(List<List<String>> docs, String term) {
double n = 0;
for (List<String> doc : docs) {
for (String word : doc) {
if (term.equalsIgnoreCase(word)) {
n++;
break;
}
}
}
return Math.log(docs.size() / n);
}
/**
* @param doc a text document
* @param docs all documents
* @param term term
* @return the TF-IDF of term
*/
public double tfIdf(List<String> doc, List<List<String>> docs, String term) {
return tf(doc, term) * idf(docs, term);
}
public static void main(String[] args) {
List<String> doc1 = Arrays.asList("Lorem", "ipsum", "dolor", "ipsum", "sit", "ipsum");
List<String> doc2 = Arrays.asList("Vituperata", "incorrupte", "at", "ipsum", "pro", "quo");
List<String> doc3 = Arrays.asList("Has", "persius", "disputationi", "id", "simul");
List<List<String>> documents = Arrays.asList(doc1, doc2, doc3);
TFIDFCalculator calculator = new TFIDFCalculator();
double tfidf = calculator.tfIdf(doc1, documents, "ipsum");
System.out.println("TF-IDF (ipsum) = " + tfidf);
}
}
@remipichon
Copy link

Thanks mate, couldn't figure out how to compute this efficiently enough !

@sadrizahir
Copy link

Thanks. Just starting on this toping and this is great help.

@K75
Copy link

K75 commented May 22, 2017

C:\Users\M\Desktop>javac TFIDFCalculator.java
C:\Users\M\Desktop>java TFIDFCalculator
Error: Could not find or load main class TFIDFCalculator
Sir, I am getting this error, can you help me to sort out this error..

@jottinger
Copy link

K75: It sounds like you have %CLASSPATH% defined, given no other information. Try:

javac TFIDFCalculator.java
java -cp . TFIDFCalculator

@yuunix
Copy link

yuunix commented Jan 2, 2018

in idf method you should add the case where "term" doesn't exist because you can't /0

@abhijithanil
Copy link

return Math.log(docs.size() / n);
cast it..
return Math.log((double)docs.size() /*(double) n);

@JoaoCoimbra2018
Copy link

Everything allrigth

@lq920320
Copy link

lq920320 commented Jun 5, 2018

Thanks.

@modelarious
Copy link

This is a great demo, thanks for putting this up

@Mohamed-El
Copy link

Thank you, just getting into text mining and this is very helpful.

@qmHecker
Copy link

Hi,

I'm currently looking into TF-IDF for the first time.
One detail question:

In line 39 you are using Math.log, which returns the natural logarithm (base e) (https://docs.oracle.com/javase/7/docs/api/java/lang/Math.html).
The wikipedia article (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) states that base 10 logarithm should be used, so shouldn't this be changed to:
Math.log10(docs.size() / n)

Kind regards,
Michael

@dreamerns
Copy link

I think you will get exception in line return Math.log(docs.size() / n); in case n is 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment