Created
June 20, 2018 11:51
Star
You must be signed in to star a gist
Introductory tutorial to text clustering with R
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Introductory tutorial to text clustering with R" | |
output: | |
html_document: | |
df_print: paged | |
--- | |
```{r Initializing the script} | |
# Loading the packages that will be used | |
list.of.packages <- c("tm", "dbscan", "proxy", "colorspace") | |
# (downloading and) requiring packages | |
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] | |
if(length(new.packages)) | |
install.packages(new.packages) | |
for (p in list.of.packages) | |
require(p, character.only = TRUE) | |
rm(list = ls()) # Cleaning environment | |
options(header = FALSE, stringsAsFactors = FALSE, fileEncoding = "latin1") | |
``` | |
# 0. The dataset, and the goal to achieve | |
## Opening notes on R | |
As you will notice, variables will contain dot (".") in their name. | |
This might repel some people because the dot uses to be an operator used in structure and/or objects operations. | |
This is not the case in R, and the character is somehow used as a semantic comfort for variable naming. | |
## Opening notes on the problem | |
We are going to cluster a dataset consisting of health news tweets. These short sentences belong to one of the 16 sources of news considered in the dataset. | |
We are then facing a multi-label classifying problem, with `num_classes = 16`. | |
```{r Setting the hyper-parameter} | |
truth.K <- 16 | |
``` | |
# 1. Data acquisition | |
We are about to download directly the data from the *UCI Machine Learning repository*. | |
Thanks to native functions, we are able to download the zip file, extract it and fill a dataframe with all the text files read iteratively. | |
```{r Downloading and reading the data} | |
# Creating the empty dataset with the formatted columns | |
dataframe <- data.frame(ID=character(), | |
datetime=character(), | |
content=character(), | |
label=factor()) | |
source.url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00438/Health-News-Tweets.zip' | |
target.directory <- '/tmp/clustering-r' | |
temporary.file <- tempfile() | |
download.file(source.url, temporary.file) | |
unzip(temporary.file, exdir = target.directory) | |
# Reading the files | |
target.directory <- paste(target.directory, 'Health-Tweets', sep = '/') | |
files <- list.files(path = target.directory, pattern='.txt$') | |
# Filling the dataframe by reading the text content | |
for (f in files) { | |
news.filename = paste(target.directory , f, sep ='/') | |
news.label <- substr(f, 0, nchar(f) - 4) # Removing the 4 last characters => '.txt' | |
news.data <- read.csv(news.filename, | |
encoding = 'UTF-8', | |
header = FALSE, | |
quote = "", | |
sep = '|', | |
col.names = c('ID', 'datetime', 'content')) | |
# Trick to ignore last part of tweets which content contains the split character "|" | |
# No satisfying solution has been found to split (as in Python) and merging extra-columns with the last one | |
news.data <- news.data[news.data$content != "", ] | |
news.data['label'] = news.label # We add the label of the tweet | |
# Only considering a little portion of data ... | |
# ... because handling sparse matrix for generic usage is a pain | |
news.data <- head(news.data, floor(nrow(news.data) * 0.05)) | |
dataframe <- rbind(dataframe, news.data) | |
} | |
# Deleting the temporary directory | |
unlink(target.directory, recursive = TRUE) | |
``` | |
As you may have already seen, some problems have been trickily avoided. | |
First, as far as I know, no easy method is possible to merge extra columns (for extra separator in the text) with the last. The choice has been to ignore these news lines (oddly formatted then) and to keep the truncated data for the associated beginning line. | |
Second, a tiny fraction of the data is kept. Indeed, as we will be using TF-IDF for sentence representation, the matrices will be sparse so huge. It obviously exists libraries to handle big data and/or sparse matrices, but here where R fails a lot compared to Python in my opinion : there is low compatibility between these external structures and processing packages (clustering, classifiers, ...) | |
# 2. Preprocessing the data | |
## Manual preprocessing | |
The only manual preprocessing that we will do is to remove the URLs in the tweets. They may help to cluster if a specific "url shortener" is used by a health news source but it is not a very common text resource to handle in NLP and moreover is a kind of cheat in the clustering process. | |
```{r Manual preprocessing} | |
sentences <- sub("http://([[:alnum:]|[:punct:]])+", '', dataframe$content) | |
``` | |
## Automatic preprocessing | |
For common preprocessing problems, we are going to use a package that facilitates a lot the tasks. | |
This help is appreciated because R is *not* the perfect tool to process and computes on characters (a lot harder compared to Python) | |
This library is *tm* (*T*ext *M*ining) | |
```{r Automatic preprocessing} | |
corpus = tm::Corpus(tm::VectorSource(sentences)) | |
# Cleaning up | |
# Handling UTF-8 encoding problem from the dataset | |
corpus.cleaned <- tm::tm_map(corpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte')) | |
corpus.cleaned <- tm::tm_map(corpus.cleaned, tm::removeWords, tm::stopwords('english')) # Removing stop-words | |
corpus.cleaned <- tm::tm_map(corpus, tm::stemDocument, language = "english") # Stemming the words | |
corpus.cleaned <- tm::tm_map(corpus.cleaned, tm::stripWhitespace) # Trimming excessive whitespaces | |
``` | |
# 3. Text representation | |
Now, we have a sequence of cleaned sentences that we can use to build our TF-IDF matrix. | |
From this result, we will be able to execute every numerical processes that we want, such as clustering. | |
```{r Representation and distance matrices} | |
# Building the feature matrices | |
tdm <- tm::DocumentTermMatrix(corpus.cleaned) | |
tdm.tfidf <- tm::weightTfIdf(tdm) | |
# We remove A LOT of features. R is natively very weak with high dimensional matrix | |
tdm.tfidf <- tm::removeSparseTerms(tdm.tfidf, 0.999) | |
# There is the memory-problem part | |
# - Native matrix isn't "sparse-compliant" in the memory | |
# - Sparse implementations aren't necessary compatible with clustering algorithms | |
tfidf.matrix <- as.matrix(tdm.tfidf) | |
# Cosine distance matrix (useful for specific clustering algorithms) | |
dist.matrix = proxy::dist(tfidf.matrix, method = "cosine") | |
``` | |
# 4. Running the clustering algorithms | |
## Partitioning clustering | |
As a partitioning clustering, we will use the famous K-means algorithm. As we know the dataset, we can define properly the number of awaited clusters | |
```{r Partitioning clustering} | |
clustering.kmeans <- kmeans(tfidf.matrix, truth.K) | |
``` | |
## Hierarchical clustering | |
R comes with an easy interface to run hierarchical clustering. All we have to define is the clustering criterion and the pointwise distance matrix. | |
We will be using the Ward's method as the clustering criterion. | |
```{r Hierarchical clustering} | |
clustering.hierarchical <- hclust(dist.matrix, method = "ward.D2") | |
``` | |
## Density-based clustering | |
To try the density-based clustering, we will run the HDBScan algorithm. We can run it easily from an external package, *dbscan*. | |
Regarding the hyper-parameters of the algorithm, a more or less arbitrary value has been fixed. | |
```{r Density-based clustering} | |
clustering.dbscan <- dbscan::hdbscan(dist.matrix, minPts = 10) | |
``` | |
## Stacking clustering | |
As a final clustering, we will use a hard-voting strategy to merge the results between the 3 previous clustering. | |
It goes like this : | |
- we define a *master* clustering, all the other are *slave* clusterings. There, we chose arbitrarily the K-means clustering as the *master* clustering | |
```{r Stacking clustering initialization} | |
master.cluster <- clustering.kmeans$cluster | |
slave.hierarchical <- cutree(clustering.hierarchical, k = truth.K) | |
slave.dbscan <- clustering.dbscan$cluster | |
# Preparing the stacked clustering | |
stacked.clustering <- rep(NA, length(master.cluster)) | |
names(stacked.clustering) <- 1:length(master.cluster) | |
``` | |
- Then, for each cluster label in the *master* clustering, we recursively hard vote the cluster on each slave, meaning that we define the corresponding stacked clustering as the most found label on the first cluster, then the most found label of the previous found label in the second cluster. | |
- This may result in big clusters, which is a risk of an hard-vote stacking clustering. | |
- It is of course not the perfect solution to stack clusterings, but an easy one to implement in R | |
```{r Stacking clustering execution} | |
for (cluster in unique(master.cluster)) { | |
indexes = which(master.cluster == cluster, arr.ind = TRUE) | |
slave1.votes <- table(slave.hierarchical[indexes]) | |
slave1.maxcount <- names(slave1.votes)[which.max(slave1.votes)] | |
slave1.indexes = which(slave.hierarchical == slave1.maxcount, arr.ind = TRUE) | |
slave2.votes <- table(slave.dbscan[indexes]) | |
slave2.maxcount <- names(slave2.votes)[which.max(slave2.votes)] | |
stacked.clustering[indexes] <- slave2.maxcount | |
} | |
``` | |
# 5. Plotting the results | |
Plotting is one of the best features of R in my opinion. Compared to Python, I find R more convenient, comfortable and easier to create, store and save plots of every kind. | |
The native solution is enough complete to cover most of the basics features of a plot. | |
For more aesthetic and complex requirements, it exists the famous *ggplot2* package. It is widely used but a lot less accessible in its usage interface. | |
To plot our clustering, as our feature spaces is highly dimensional (TF-IDF representation), we will reduce it to 2 thanks to the multi-dimensional scaling. | |
This technique is dependent of our distance metric, but in our case with TF-IDF, it is highly preferable than the famous PCA technique. | |
```{r Plotting} | |
points <- cmdscale(dist.matrix, k = 2) # Running the PCA | |
palette <- colorspace::diverge_hcl(truth.K) # Creating a color palette | |
previous.par <- par(mfrow=c(2,2), mar = rep(1.5, 4)) # partitionning the plot space | |
plot(points, | |
main = 'K-Means clustering', | |
col = as.factor(master.cluster), | |
mai = c(0, 0, 0, 0), | |
mar = c(0, 0, 0, 0), | |
xaxt = 'n', yaxt = 'n', | |
xlab = '', ylab = '') | |
plot(points, | |
main = 'Hierarchical clustering', | |
col = as.factor(slave.hierarchical), | |
mai = c(0, 0, 0, 0), | |
mar = c(0, 0, 0, 0), | |
xaxt = 'n', yaxt = 'n', | |
xlab = '', ylab = '') | |
plot(points, | |
main = 'Density-based clustering', | |
col = as.factor(slave.dbscan), | |
mai = c(0, 0, 0, 0), | |
mar = c(0, 0, 0, 0), | |
xaxt = 'n', yaxt = 'n', | |
xlab = '', ylab = '') | |
plot(points, | |
main = 'Stacked clustering', | |
col = as.factor(stacked.clustering), | |
mai = c(0, 0, 0, 0), | |
mar = c(0, 0, 0, 0), | |
xaxt = 'n', yaxt = 'n', | |
xlab = '', ylab = '') | |
par(previous.par) # recovering the original plot space parameters | |
``` |
How do you identify the words in the clusters. Did you ever get an answer to the question?
Hello,
Sorry for the late answer, I didn't get any notification about these comments.
I will try to give an helpful answer, as I didn't really use R a lot since I wrote that notebook
By looking at the code again, I would propose the following solution :
- start from the list of clusters based on the chosen clustering (e.g.
master.cluster
orstacked.clustering
) - iterate over the different clusters labels, i.e.
unique(your_chosen_clustering)
- get the relevant indexes of the points belonging to that cluster :
cluster_indexes = which(your_chosen_clustering == cluster, arr.ind = TRUE)
- Try this piece of code to get relevant information :
cluster_inspection = inspect(tdm.tfidf[cluster_indexes]
This is not tested, as I don't have an R environment to try out. The cluster_inspection
variable should contain a pretty printing of the information that you want.
Nice explanation! Just one remark: why have you called a DocumentTermMatrix tdm? This could be confused with TermDocumentMatrix in my opinion.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thank you so much for sharinf this to the world. But I still can´t figure out a way of getting the text data from the corpus file, meaning that I want to access the words the clusters are made of and the plain text vector after cleaning it and so, could please tell me, if you know?