Skip to content

Instantly share code, notes, and snippets.

@bxjx
Created October 16, 2013 01:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bxjx/7001437 to your computer and use it in GitHub Desktop.
Save bxjx/7001437 to your computer and use it in GitHub Desktop.
Example of using gramophone ngrams as input into Natural's TF-IDF function. It's adapted from the example in the Natural docs. NOTE that all arguments to the tfidf methods must be arrays. If they are strings, the tfidf object will run the default tokenizer over the argument. I might submit a pull request to natural so that `natural.TfIdf` could …
var natural = require('natural'),
TfIdf = natural.TfIdf,
tfidf = new TfIdf();
var gramophone = require('gramophone');
var docs = [
'this document is about node programming language.',
'this document is about ruby programming language.',
'this document is about the ruby programming language and node programming language.',
'this document is about node programming language. it has node programming language examples'
];
docs.forEach(function(doc, index){
var ngrams = gramophone.extract(doc, { min: 1 , flatten: true});
console.error('ngrams for doc ' + index + ':');
console.error(ngrams);
tfidf.addDocument(ngrams);
});
console.log('node programming language -----------');
tfidf.tfidfs(['node programming language'], function(i, measure) {
console.log('document #' + i + ' is ' + measure);
});
console.log('"document" --------------------------------');
tfidf.tfidfs('"document"', function(i, measure) {
console.log('document #' + i + ' is ' + measure);
});
@mef
Copy link

mef commented Oct 17, 2013

running this produces the following output:

ngrams for doc 0:
[ 'node programming language', 'document' ]
ngrams for doc 1:
[ 'ruby programming language', 'document' ]
ngrams for doc 2:
[ 'ruby programming language',
  'node programming language',
  'document' ]
ngrams for doc 3:
[ 'node programming language',
  'node programming language',
  'programming language examples',
  'document' ]
node programming language -----------
document #0 is 0
document #1 is 0
document #2 is 0
document #3 is 0
"document" --------------------------------
document #0 is -0.1823215567939546
document #1 is -0.1823215567939546
document #2 is -0.1823215567939546
document #3 is -0.1823215567939546

modules:

  • gramophone 0.0.3
  • natural 0.1.23

Am I missing something ?

edit: I have the same problem using only natural's tf-idf example, therefore my problem has nothing to do with gramophone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment