Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Simple example of classifying text in R with machine learning (text-mining library, caret, and bayesian generalized linear model). Classify. tfidf tdm term document matrix
library(caret)
library(tm)
# Training data.
data <- c('Cats like to chase mice.', 'Dogs like to eat big bones.')
corpus <- VCorpus(VectorSource(data))
# Create a document term matrix.
tdm <- DocumentTermMatrix(corpus, list(removePunctuation = TRUE, stopwords = TRUE, stemming = TRUE, removeNumbers = TRUE))
# Convert to a data.frame for training and assign a classification (factor) to each document.
train <- as.matrix(tdm)
train <- cbind(train, c(0, 1))
colnames(train)[ncol(train)] <- 'y'
train <- as.data.frame(train)
train$y <- as.factor(train$y)
# Train.
fit <- train(y ~ ., data = train, method = 'bayesglm')
# Check accuracy on training.
predict(fit, newdata = train)
# Test data.
data2 <- c('Bats eat bugs.')
corpus <- VCorpus(VectorSource(data2))
tdm <- DocumentTermMatrix(corpus, control = list(dictionary = Terms(tdm), removePunctuation = TRUE, stopwords = TRUE, stemming = TRUE, removeNumbers = TRUE))
test <- as.matrix(tdm)
# Check accuracy on test.
predict(fit, newdata = test)
> data
[1] "Cats like to chase mice." "Dogs like to eat big bones."
> train
big bone cat chase dog eat like mice y
1 0 0 1 1 0 0 1 1 0
2 1 1 0 0 1 1 1 0 1
> predict(fit, newdata = train)
[1] 0 1
> data2
[1] "Bats eat bugs."
> test
big bone cat chase dog eat like mice
1 0 0 0 0 0 1 0 0
> predict(fit, newdata = test)
[1] 1
>
@gsaray101

This comment has been minimized.

Copy link

commented Aug 2, 2017

when you predict on test data, you get 1. what does this mean? can you briefly explain it?

@josvaler

This comment has been minimized.

Copy link

commented Sep 2, 2017

There's a problem in

Train.

fit <- train(y ~ ., data = train, method = 'bayesglm')

With this output:

Error in model.frame.default(form = y ~ ., data = train, na.action = na.fail) :
invalid type (list) for variable 'y'

@primaryobjects

This comment has been minimized.

Copy link
Owner Author

commented Sep 26, 2017

@gsaray101 The 1 indicates the y value. In this case, it represents "eating". A 0 would represent "not eating".

The example above has a training set of 2 records, with the y-value indicating whether the sentence is about eating or not. So, when we run the model on the test sentence, we get a 1. :)

@primaryobjects

This comment has been minimized.

Copy link
Owner Author

commented Sep 26, 2017

@josvaler Be sure to copy the code as shown above. Specifically, note the type for the y column is a factor. I run the above code successfully. I'm running R 3.3.3.

> train
  big bone cat chase dog eat like mice y
1   0    0   1     1   0   0    1    1 0
2   1    1   0     0   1   1    1    0 1
@bcafferky

This comment has been minimized.

Copy link

commented Apr 12, 2018

Nice example - just enough to cover the concepts. I did find I had to install additional packages not listed, i.e.

library('SnowballC')
library('minqa')
library('e1071')
library('caret')
library('tm')

Thanks

@CallMe-Sri

This comment has been minimized.

Copy link

commented May 8, 2018

The 1 indicates the y value. In this case, it represents "eating". A 0 would represent "not eating".

The example above has a training set of 2 records, with the y-value indicating whether the sentence is about eating or not. So, when we run the model on the test sentence, we get a 1.

-----My Question is where do you specify in code 'eat' is a word to predict 0 or 1. If I like to add other word "like", where should I do the changes. Please explain

@chintamanand

This comment has been minimized.

Copy link

commented Jun 22, 2018

how did u say that y(Dependent variable) is for eating and not Eating classes??
Why can't I consider "y" has sleeping or not-Sleeping classes?
Is it depends on terms used in the Document

@rachhitgarg

This comment has been minimized.

Copy link

commented Aug 20, 2018

what if our test data have different keywords in the text , can we classify the

suppose test data = ### "dogs are mostly of brown colour"

it is showing error

dims of 'test' and 'train' differ

@seymakalay

This comment has been minimized.

Copy link

commented Feb 6, 2019

hello thank you very much sharing this, but I belive
predict(fit, newdata = train) should be tested on the test set rather then train? as this link suggests : https://cran.r-project.org/web/packages/caret/vignettes/caret.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.