Skip to content

Instantly share code, notes, and snippets.

@akbertram
Created February 13, 2018 09:12
Show Gist options
  • Save akbertram/1de1d09276f5a900c6a6bca639f38727 to your computer and use it in GitHub Desktop.
Save akbertram/1de1d09276f5a900c6a6bca639f38727 to your computer and use it in GitHub Desktop.
---
title: "Text Mining with R"
date: "13 February 2018"
output:
html_document:
theme: flatly
highlight: haddock
code_folding: show
toc: true
toc_float:
collapsed: true
smooth_scroll: false
---
<style>
#TOC {
background: url("images/bdd-logo-square.png");
background-size: 35%;
background-position: center top;
padding-top: 80px !important;
background-repeat: no-repeat;
}
</style>
```{r setup, include=FALSE}
# For caching Rmd outputs
knitr::opts_chunk$set(cache=FALSE)
```
# Introduction
* We will continue with the school as the unit of analysis and build on the term-frequency matrices we looked at last week and
introduce two machine learning techniques:
- **Supervised machine learning** works by "training" a model on a set of "labeled" documents.
- **Clustering** is an unsupervised technique to uncover groups in the datasets
## Supervised machine learning
* Most machine learning algorithms require an input set of features
* We need to transform the school guides into a matrix of schools and factors
* Term-frequency matrices are one example of features
* Machine learning algorithms can be very accurate classifiers, they tend to have low explanatory power.
### Training
Load the full(-ish) the pre-computed normalized terms matrix from last week:
```{r}
ndtm <- readRDS("../../../duo/schoolgids2017v2_ndtm_v2.rds")
schools <- read.csv("../../../duo/schools.csv", sep=";", stringsAsFactors = FALSE)
ndtm_meta <- merge(data.frame(VESTIGINGSNUMMER = rownames(ndtm)), schools, all.x = TRUE)
```
The `ndtm` matrix contains a normalized term-frequent matrix for all the schools in our corpus.
```{r}
ndtm[1:10, 1:20]
```
* We will split our sample into a training and testing samples
```{r}
training <- sample(nrow(ndtm), size = 500)
predictors <- ndtm[training, ]
response <- as.factor(ndtm_meta$DENOMINATIE[training])
library(e1071)
model <- svm(x = predictors, y = response)
```
* Now we can use this model to predict the schools in the test set
```{r}
predicted <- predict(model, ndtm[-training, ])
table(predicted)
```
* Let's see how our model did:
```{r}
correct <- predicted == ndtm_meta$DENOMINATIE[-training]
table(correct)
```
Let's look at how we did by category:
```{r}
table(ndtm_meta$DENOMINATIE[-training], correct)
```
## Tokenizing by n-gram
* In n-gram tokenization, we token the pairs of adjacent words together rather than by individual ones.
* In some cases, the bi-grams (or tri-grams or quad-grams) can provide better
## Analyzing bigrams
Loading the necessary packages:
```{r}
library(tm)
library(SnowballC)
```
* For 3-grams, 4-grams, *...*, *n*-grams, change the integer of n argument in the `ngrams()` function.
* When the variable set to 2, we are examining the pairs of two consecutive words, often called "bigrams"
To do this, we can create our function:
```{r}
BigramTokenizer <- function(text) {
# To lowercase
text <- tolower(text)
# Remove web addresses
text <- gsub(text, pattern="https?://\\S+", replacement = " ")
text <- gsub(text, pattern="www\\.\\S+", replacement = " ")
# Remove email addresses
text <- gsub(text, pattern="\\S+@\\S+", replacement = " ")
# Remove all non alpha characters and collapse whitespace
text <- gsub(text, pattern="[^[:alpha:]]+", replacement = " ")
unlist(lapply(ngrams(words(text), 2), paste, collapse = " "), use.names = FALSE)
}
```
We can test the function on some text input:
```{r}
BigramTokenizer("Op de openbare basisscholen van Proloog krijgen kinderen alle ruimte
om zichzelf te ontwikkelen. Om te worden wie ze zijn. Want daar gaat het
om: jezelf zijn, je talenten ontwikkelen, kennis en vaardigheden opdoen,
plezier maken en gelukkig zijn. Wij willen dat onze leerlingen goed
onderwijs krijgen én een fantastische basisschooltijd hebben. En daar
zetten onze leraren zich optimaal voor in. Zo wordt elke leerling wie het is.
www.proloog.nl")
```
```{r}
corpus <- readRDS("../../../duo/schoolgids2017v2_500.rds")
```
Term frequencies of bigrams:
```{r}
bimatrix <- DocumentTermMatrix(corpus, control = list(tokenize = BigramTokenizer,
language = "nl",
stemming = TRUE,
weighting = weightTf))
```
## Cluster Analysis
Then, we can move to cluster analysis.
* *Cluster analysis* is a data-reduction technique designed to uncover subgroups of observations within a dataset.
* A *cluster* is defined as a group of observations that are more similar to each other than they are to the observations in other groups.
* This isn't a precise definition, and that fact has led to an enormous variety of clustering methods.
The aim of cluster analysis is to identify groups of observations so that within a group the observations are most similar to each other, whilst between groups the observations are most dissimilar to each other. Cluster analysis is essentially an unsupervised method.
## k-means clustering
* It is a partitioning approach where observations are randomly divided into *K* groups and reshufled to form cohesive clusters.
* K-means clustering can handle larger datasets than hiearchical cluster approaches.
* Select variables that you feel may be important for identifying and understanding differences among groups of observations within the data.
* We are hand selecting the components on which to cluster, PCA or another technique could be used.
* In order to obtain a final cluster solution, you must decide how many clusters are present in the data (specify *K* the number of clusters sought).
* Since k-means cluster analysis starts with *k*-randomly chosen centroids, use the `set.seed()` to guarantee that the results are reproducible.
As k-means clustering requires that you specify in advance the number of clusters to extract, a plot of the total within-groups sums of squares (WSS) against the number of clusters in a k-means solution can be helpful. It can suggest the appropriate number of clusters:
```{r}
# Start with our ndtm matrix, but choose a few simple words
comp <- as.data.frame(ndtm[, c("computer", "kunst", "muziek", "reken", "geschiedenis")])
head(comp)
```
```{r}
library(ggplot2)
ggplot(comp, aes(x = computer, y = muziek)) + geom_point() + xlim(0, 40) + ylim(0, 40)
```
```{r}
library(ggplot2)
library(scales)
library(RColorBrewer)
ggplot(comp, aes(x = kunst, y = muziek)) + geom_point() + xlim(0, 40) + ylim(0, 40)
```
```{r}
set.seed(1001)
# Determine number of clusters
wss <- (nrow(comp)-1) * sum(apply(comp, 2, var))
for (i in 2:15) wss[i] <- sum(kmeans(comp,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
```
* The sharp decreases from one to two clusters suggests a three-cluster solution.
* Alternatively, you can use `wssplot()` function from the **rattle** package.
* `nstart` argument in `kmeans()` function generates multiple initial configurations and reports on the best one.
```{r}
set.seed(1001)
k <- kmeans(comp, 3, nstart=25, iter.max=1000)
palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(comp, col=k$clust, pch=16)
```
Examining the model:
```{r}
# The sizes of the clusters
k$size
# The cluster centers (means):
k$centers
```
You can continue examining it with other options:
```{r, eval = FALSE}
# The total sum of squares
k$totss
# Vector of within-cluster sum of squares, one component per cluster.
k$withinss
# Total within-cluster sum of squares, i.e. sum(withinss).
k$tot.withinss
# The between-cluster sum of squares, i.e. totss-tot.withinss.
k$betweenss
```
Cluster sizes:
```{r}
sort(table(k$clust))
clust <- names(sort(table(k$clust)))
```
------------------------------------------------------------------------------
# Data visualization with ggplot2
```{r, include=FALSE}
titanic <- read.csv("titanic.csv", sep = ",", stringsAsFactors = TRUE)
```
Using an interesting "Titanic" dataset:
```{r}
head(titanic)
```
## Creating a ggplot
* The `ggplot2` package implements a system for creating graphics in R
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y= Fare))
```
* Each geom function in **ggplot2** takes a mapping argument. This defines how variables in your dataset are mapped to visual properties.
* The mapping argument is always paired with `aes()`, and the x and y arguments of `aes()` specify which variables to map to the x and y axes.
* In `ggplot2`, plots are created by chaining together functions using the plus (`+`) sign.
## Aesthetic mappings
* The option in the `aes()` function specify what role each variable will play.
- aes stands for *aesthetics*, or how information is represented visually.
* You can add a third variable to the scatterplot above, like `Sex`, to a two-dimensional scatterplot by mapping it to an aesthetic.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y = Fare, color = Sex))
```
* You can also use `size` as an aesthetic (however using unordered variable (Sex) for size is not a good idea).
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y = Fare, size = Sex))
```
* Shape may be as distinctive as color.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y = Fare, shape = Sex))
```
* Or you can set the color for all plot:
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y = Fare), color = "blue")
```
**Question:** What is wrong with that plot?
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y = Fare, color = "blue"))
```
Setting aesthetic properties to the outside of the `aes()` only changes the appearance of the plot, that doesn't convey information about a variable.
## Facets
* Facet is to split your plot into subplots that each display one subset of the data.
* It is particularly useful for categorical variables.
* Use `facet_wrap()` function, that the first argument should be a data structure type formed with `~` (tilde) symbol.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y = Fare, color = Sex)) +
facet_wrap( ~ Pclass, nrow = 3)
```
* To facet your plot on the combination of two variables, add `facet_grid()` to your plot call. The data structure should contain two variable names separated by a `~` again.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y = Fare, color = Sex)) +
facet_grid(Survived ~ Pclass)
```
## Geometric objects
* A **geom** is the geometrical object that a plot uses to represent data. We describe plots by the type of geom that the plot uses.
* Bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
* Every geom function in **ggplot2** takes a mapping argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn't set the "shape" of a line.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_smooth(mapping = aes(x = Age, y = Fare, color = Sex))
```
* To display multiple geoms in the same plot, add multiple geom functions to `ggplot()`.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_point(mapping = aes(x = Age, y = Fare)) +
geom_smooth(mapping = aes(x = Age, y = Fare))
```
* **ggplot2** will treat the mappings as global mappings that apply to each geom in the graph. You are advised to produce the plot in that way to avoid duplication.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic, mapping = aes(x = Age, y = Fare)) +
geom_point() +
geom_smooth()
```
* If you place mappings in a geom function, **ggplot2** will treat them as local mappings for the layer. This makes it possible to display different aesthetics in different layers.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic, mapping = aes(x = Age, y = Fare)) +
geom_point(mapping = aes(color = Sex)) +
geom_smooth()
```
## Statistical transformations
### Bar charts
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_bar(mapping = aes(x = Pclass), stat = "count")
```
* You can have a different y-axis. For this you have to change the **stat** argument.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_bar(mapping = aes(x = Pclass, y = Fare), stat = "identity")
```
## Position adjustments
* You can color a bar chart using either the `color` aesthetic, or more usefully, `fill`.
* The `labs()` and `ggtitle()` etc. functions are optional and adds *annotations* (axis labels and a title).
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_bar(mapping = aes(x = as.factor(Survived), fill = Sex), stat = "count") +
xlab("Survived") +
ylab("Passengers") +
ggtitle("Survival by Sex")
```
* `position = "fill"` works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_bar(mapping = aes(x = as.factor(Survived), fill = Sex), stat = "count", position = "fill")
```
* `position = "dodge"` places overlapping objects directly *beside* one another. This makes easier to compare individual values.
```{r message=FALSE, warning=FALSE}
ggplot(data = titanic) +
geom_bar(mapping = aes(x = as.factor(Survived), fill = Sex), stat = "count", position = "dodge")
```
## Saving graphs
* You can save the graphs with `ggsave()` function with the options include which plot to save, where to save it, and in what format.
```{r, eval=FALSE}
myplot <- ggplot(data=mtcars, aes(x=mpg)) + geom_histogram()
ggsave(file = "mygraph.png", plot=myplot, width = 5, height = 4)
```
## Summary
```
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
```
* The code template for general plotting is above.
* The most essential parameters are the **data**, the **mappings**, and the **geom function**. You rarely need to supply all seven parameters to make a graph.
* Geometric objects (called *geoms* for short) produces the visual outputs including points, lines, bars, box plots, and shaded regions. Geoms can be added to the graph using one or more *geom functions*.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment