Instantly share code, notes, and snippets.

@scjs scjs/classification.Rmd
Last active Jan 21, 2019

Embed
What would you like to do?
Classification model for Yerevan Armenian plosives
---
title: "Classification model for Yerevan Armenian plosives"
output: html_document
---
This document fits a multinomial classification model to the Yerevan Armenian
plosives data. Voicing category (voiced, voiceless, or aspirated) is predicted
from a set of acoustic measurements. The goals are to test whether the three-way
voicing contrast can be discriminated above chance in both word-initial and
word-final position, and to evaluate which acoustic predictors contribute to
classification.
# Model setup
```{r}
library("caret")
library("forcats")
library("nnet")
library("rlang")
library("stringr")
library("tidyverse")
library("xtable")
load("cleaned.Rdata")
```
List the variables that will be used in the model, including formatted names
that can be used in a table.
```{r}
model_vars <- c(
"H1*-H2*" = "vowel_H1H2c",
"Cepstral peak prominence" = "vowel_CPP",
"Log SoE during closure" = "log_closure_soe",
"Voice onset/offset time" = "voice_time",
"Aspiration duration" = "aspiration_duration",
"Closure duration" = "closure_duration",
"Vowel duration" = "vowel_duration",
"Pitch" = "vowel_strF0"
)
```
Prepare the data for the model:
1. Create the log-transformed SoE variable, and combine word-initial voice onset
and word-final voice offset times into one column.
2. Remove observations with any missing measurements.
3. Center all of the variables within-speaker.
4. Convert `voicing` into a factor with a reference level of `voiceless`.
```{r}
model_data <- tokens %>%
mutate(
log_closure_soe = log(closure_soe + 0.001),
voice_time = coalesce(VOT, VOFT)
) %>%
filter_at(vars(model_vars), all_vars(!is.na(.))) %>%
group_by(speaker) %>%
mutate_at(vars(model_vars), scale, scale = FALSE) %>%
ungroup() %>%
mutate(
voicing = factor(voicing, levels = c("voiceless", "aspirated", "voiced"))
)
```
Define the model formula, and fit a model to all of the data. The class to be
predicted is in `voicing`, and there are predictors for all of the variables
listed above, plus their interactions with `position` (word-initial or
word-final).
```{r}
predictors <- paste(
c(model_vars, paste(model_vars, "position", sep = ":")),
collapse = " + "
)
model_formula <- reformulate(
response = "voicing",
termlabels = predictors
)
```
```{r}
mod_multinomial <- multinom(formula = model_formula, data = model_data)
```
# Make predictions
Define a function to create a balanced sample of the data, so that there are an
equal number of each class. The row(s) with the index `heldout` will not be
included in the sample.
```{r}
get_balanced_training_sample <- function(data, response, heldout) {
grouped_by_response <- group_by(data[-heldout, ], !!response)
smallest_group_size <- min(count(grouped_by_response)[["n"]])
sample_n(grouped_by_response, smallest_group_size)
}
```
Define a function which gets a training sample, refits the model to it, and then
makes a prediction for the held-out row(s).
```{r}
get_heldout_prediction <- function(model, data, heldout, ...) {
training <- get_balanced_training_sample(data, terms(model)[[2]], heldout)
fit <- update(model, data = training, ...)
predict(fit, newdata = data[heldout, ])
}
```
Use `get_heldout_prediction()` to make predictions for each row.
```{r}
model_data[["prediction"]] <- sapply(
X = 1:nrow(model_data),
FUN = get_heldout_prediction,
model = mod_multinomial,
data = model_data,
trace = FALSE
)
```
# Evaluate accuracy
What percentage of predictions were correct, for different subsets of the data?
```{r}
mean(model_data$prediction == model_data$voicing)
group_by(model_data, position) %>%
summarize(accuracy = mean(prediction == voicing))
group_by(model_data, voicing) %>%
summarize(accuracy = mean(prediction == voicing))
group_by(model_data, position, voicing) %>%
summarize(accuracy = mean(prediction == voicing))
group_by(model_data, position, context) %>%
summarize(accuracy = mean(prediction == voicing))
group_by(model_data, position, context, voicing) %>%
summarize(accuracy = mean(prediction == voicing))
group_by(model_data, transliteration, word, position, voicing) %>%
summarize(accuracy = mean(prediction == voicing)) %>%
arrange(accuracy)
```
Examine which categories were confused for which other categories in each
position. For word-final position, also check the differences between the two
carrier phrases.
```{r}
with(
model_data,
confusionMatrix(prediction, voicing)
)
with(
filter(model_data, position == "initial"),
confusionMatrix(prediction, voicing)
)
with(
filter(model_data, position == "final"),
confusionMatrix(prediction, voicing)
)
with(
filter(model_data, position == "final", context == "before voiced"),
confusionMatrix(prediction, voicing)
)
with(
filter(model_data, position == "final", context == "before voiceless"),
confusionMatrix(prediction, voicing)
)
```
# Evaluate predictors
Which predictors uniquely contribute to classification accuracy? Predictors are
evaluated by nested model comparison.
Define a function which fits a reduced model that omits any terms which match a
string.
```{r}
fit_reduced_model <- function(model, predictor, ...) {
term_labels <- attr(terms(model), "term.labels")
# this also finds interaction terms which include the predictor
# be careful about other partial string matches that are unintended
to_drop <- str_which(term_labels, predictor)
update(model, terms(model)[-to_drop], ...)
}
```
Define a function to calculate a likelihood ratio test for a nested model
comparison. The test statistic is the delta deviance `ddev` between the two
models, which is chi-square distributed with degrees of freedom `ddf` equal to
the delta degrees of freedom between the two models.
```{r}
likelihood_ratio <- function(x, y) {
llx <- logLik(x)
lly <- logLik(y)
ddev <- abs(as.numeric(-2 * (llx - lly)))
ddf <- attr(llx, "df") - attr(lly, "df")
stopifnot(llx >= lly & ddf >= 0)
list(ddev = ddev, ddf = ddf, pchi = pchisq(ddev, ddf, lower.tail = FALSE))
}
```
Define a function which fits a nested model and returns the BIC difference and
a likelihood ratio test for the two models.
```{r}
drop_one_predictor <- function(model, predictor, ...) {
reduced_model <- fit_reduced_model(model, predictor, ...)
ll <- likelihood_ratio(model, reduced_model)
tibble(
omitted_predictor = predictor,
dBIC = BIC(reduced_model) - BIC(model),
BIC = BIC(reduced_model),
ddev = ll$ddev,
ddf = ll$ddf,
pchi = ll$pchi
)
}
```
Evaluate all nested models where each model has one predictor and its
interaction with word position omitted, and print the results to a LaTeX table
sorted by BIC difference.
```{r, comment = NA}
model_vars %>%
map_df(~ drop_one_predictor(mod_multinomial, ., trace = FALSE)) %>%
arrange(desc(dBIC)) %>%
mutate(
omitted_predictor = fct_recode(omitted_predictor, !!!model_vars),
ddf = as_integer(ddf)) %>%
xtable(
digits = c(0, 1, 1, 1, 2, 1, 3),
caption = "Model comparison statistics for omitted predictors."
) %>%
print(
booktabs = TRUE,
include.rownames = FALSE,
table.placement = "h!"
)
```
Evaluate all nested models where each model has one interaction with word
position omitted, and print the results to a LaTeX table sorted by BIC
difference.
```{r, comment = NA}
paste(model_vars, "position", sep = ":") %>%
map_df(~ drop_one_predictor(mod_multinomial, ., trace = FALSE)) %>%
arrange(desc(dBIC)) %>%
mutate(
omitted_predictor = str_remove(omitted_predictor, ":position"),
omitted_predictor = fct_recode(omitted_predictor, !!!model_vars),
omitted_predictor = paste(omitted_predictor, "Position", sep = " x "),
ddf = as_integer(ddf)) %>%
xtable(
digits = c(0, 1, 1, 1, 2, 1, 3),
caption = "Model comparison statistics for omitted interactions."
) %>%
print(
booktabs = TRUE,
include.rownames = FALSE,
table.placement = "h!"
)
```
```{r}
devtools::session_info()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment