Created
August 3, 2011 14:36
-
-
Save shreyaskarnik/1122784 to your computer and use it in GitHub Desktop.
Learning Labeled LDA Model using Stanford Topic Modeling Toolbox
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// Stanford TMT Example 6 - Training a LabeledLDA model | |
// http://nlp.stanford.edu/software/tmt/0.3/ | |
// tells Scala where to find the TMT classes | |
import scalanlp.io._; | |
import scalanlp.stage._; | |
import scalanlp.stage.text._; | |
import scalanlp.text.tokenize._; | |
import scalanlp.pipes.Pipes.global._; | |
import edu.stanford.nlp.tmt.stage._; | |
import edu.stanford.nlp.tmt.model.lda._; | |
import edu.stanford.nlp.tmt.model.llda._; | |
val source = CSVFile("examplefile.csv") ~> IDColumn(1); | |
val tokenizer = { | |
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation | |
CaseFolder() ~> // lowercase everything | |
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers | |
MinimumLengthFilter(3) // take terms with >=3 characters | |
} | |
val text = { | |
source ~> // read from the source file | |
Column(2) ~> // select column containing text | |
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above | |
TermCounter() ~> // collect counts (needed below) | |
TermMinimumDocumentCountFilter(2) ~> // filter terms in <4 docs | |
TermDynamicStopListFilter(5) ~> // filter out 30 most common terms | |
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms | |
} | |
// define fields from the dataset we are going to slice against | |
val labels = { | |
source ~> // read from the source file | |
Column(3) ~> // take column two, the year | |
TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array | |
TermCounter() ~> // collect label counts | |
TermMinimumDocumentCountFilter(1) // filter labels in < 10 docs | |
} | |
val dataset = LabeledLDADataset(text, labels); | |
// define the model parameters | |
val modelParams = LabeledLDAModelParams(dataset=dataset); | |
// Name of the output model folder to generate | |
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature); | |
//val modelPath = file("llda-gibbs-"+dataset.signature+"-"+modelParams.signature); | |
// Trains the model, writing to the given output path | |
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 2000); | |
// or could use | |
//TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500); | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hey this is my first time with scala. If in case I have multi labels for each document how should I specify that in the code above? I have around 1000 labels out of which each document is labeled with atleast one label.
Is there a wrapper for this program in python?
Also I dont have any experience with Stanford Topic Modeling Toolkit. I have 1m documents with labels. Can you tell me is it feasible with 4GB RAM?
Your help will be highly appreciated.