Instantly share code, notes, and snippets.

Embed
What would you like to do?
Simple example of classifying text in R with machine learning (text-mining library, caret, and bayesian generalized linear model). Classify. tfidf tdm term document matrix
library(caret)
library(tm)
# Training data.
data <- c('Cats like to chase mice.', 'Dogs like to eat big bones.')
corpus <- VCorpus(VectorSource(data))
# Create a document term matrix.
tdm <- DocumentTermMatrix(corpus, list(removePunctuation = TRUE, stopwords = TRUE, stemming = TRUE, removeNumbers = TRUE))
# Convert to a data.frame for training and assign a classification (factor) to each document.
train <- as.matrix(tdm)
train <- cbind(train, c(0, 1))
colnames(train)[ncol(train)] <- 'y'
train <- as.data.frame(train)
train$y <- as.factor(train$y)
# Train.
fit <- train(y ~ ., data = train, method = 'bayesglm')
# Check accuracy on training.
predict(fit, newdata = train)
# Test data.
data2 <- c('Bats eat bugs.')
corpus <- VCorpus(VectorSource(data2))
tdm <- DocumentTermMatrix(corpus, control = list(dictionary = Terms(tdm), removePunctuation = TRUE, stopwords = TRUE, stemming = TRUE, removeNumbers = TRUE))
test <- as.matrix(tdm)
# Check accuracy on test.
predict(fit, newdata = test)
> data
[1] "Cats like to chase mice." "Dogs like to eat big bones."
> train
big bone cat chase dog eat like mice y
1 0 0 1 1 0 0 1 1 0
2 1 1 0 0 1 1 1 0 1
> predict(fit, newdata = train)
[1] 0 1
> data2
[1] "Bats eat bugs."
> test
big bone cat chase dog eat like mice
1 0 0 0 0 0 1 0 0
> predict(fit, newdata = test)
[1] 1
>
@gsaray101

This comment has been minimized.

Copy link

gsaray101 commented Aug 2, 2017

when you predict on test data, you get 1. what does this mean? can you briefly explain it?

@josvaler

This comment has been minimized.

Copy link

josvaler commented Sep 2, 2017

There's a problem in

Train.

fit <- train(y ~ ., data = train, method = 'bayesglm')

With this output:

Error in model.frame.default(form = y ~ ., data = train, na.action = na.fail) :
invalid type (list) for variable 'y'

@primaryobjects

This comment has been minimized.

Copy link
Owner Author

primaryobjects commented Sep 26, 2017

@gsaray101 The 1 indicates the y value. In this case, it represents "eating". A 0 would represent "not eating".

The example above has a training set of 2 records, with the y-value indicating whether the sentence is about eating or not. So, when we run the model on the test sentence, we get a 1. :)

@primaryobjects

This comment has been minimized.

Copy link
Owner Author

primaryobjects commented Sep 26, 2017

@josvaler Be sure to copy the code as shown above. Specifically, note the type for the y column is a factor. I run the above code successfully. I'm running R 3.3.3.

> train
  big bone cat chase dog eat like mice y
1   0    0   1     1   0   0    1    1 0
2   1    1   0     0   1   1    1    0 1
@bcafferky

This comment has been minimized.

Copy link

bcafferky commented Apr 12, 2018

Nice example - just enough to cover the concepts. I did find I had to install additional packages not listed, i.e.

library('SnowballC')
library('minqa')
library('e1071')
library('caret')
library('tm')

Thanks

@CallMe-Sri

This comment has been minimized.

Copy link

CallMe-Sri commented May 8, 2018

The 1 indicates the y value. In this case, it represents "eating". A 0 would represent "not eating".

The example above has a training set of 2 records, with the y-value indicating whether the sentence is about eating or not. So, when we run the model on the test sentence, we get a 1.

-----My Question is where do you specify in code 'eat' is a word to predict 0 or 1. If I like to add other word "like", where should I do the changes. Please explain

@chintamanand

This comment has been minimized.

Copy link

chintamanand commented Jun 22, 2018

how did u say that y(Dependent variable) is for eating and not Eating classes??
Why can't I consider "y" has sleeping or not-Sleeping classes?
Is it depends on terms used in the Document

@rachhitgarg

This comment has been minimized.

Copy link

rachhitgarg commented Aug 20, 2018

what if our test data have different keywords in the text , can we classify the

suppose test data = ### "dogs are mostly of brown colour"

it is showing error

dims of 'test' and 'train' differ

@seymakalay

This comment has been minimized.

Copy link

seymakalay commented Feb 6, 2019

hello thank you very much sharing this, but I belive
predict(fit, newdata = train) should be tested on the test set rather then train? as this link suggests : https://cran.r-project.org/web/packages/caret/vignettes/caret.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment