Skip to content

Instantly share code, notes, and snippets.

@daroczig
Last active December 19, 2015 13:49
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save daroczig/5965008 to your computer and use it in GitHub Desktop.
Save daroczig/5965008 to your computer and use it in GitHub Desktop.
Analysing the results of The Cambridge Online Survey of World Englishes in the United Kingdom. See related blogpost @ http://blog.rapporter.net/2013/07/uk-dialect-maps.html
<!--head
meta:
title: UK language usage
description: Analysing the results of The Cambridge Online Survey of World Englishes
in the United Kingdom
author: ' (@daroczig)'
packages:
- class
- descr
- dismo
- raster
- RColorBrewer
- rgdal
- scales
- MASS
inputs:
- required: yes
class: character
name: q
label: Question
standalone: yes
value: Pop or soda?
length:
min: 1.0
max: 1.0
description: Question to analyse
matchable: yes
options:
- Pop or soda?
- What do you call the long cold sandwich that contains cold cuts, lettuce, and
so on?
- What is your generic casual or informal term for a sweetened carbonated beverage?
- What is your general, informal term for the rubber-soled shoes worn in gym class,
for athletic activities, etc.?
- What do you call the kind of crustacean that looks like a tiny lobster and lives
in lakes and streams?
- What word(s) do you use in casual speech to address a group of two or more people?
- What do you call the little gray (or black or brown) creature (that looks like
an insect but is actually a crustacean) that rolls up into a ball when you touch
it?
- What do you call the kind of rain that falls while the sun is shining?
- What do you call the gooey or dry matter that collects in the corners of your
eyes, especially while you are sleeping?
- How do you pronounce the vowel sound in the word 'aunt' ("parent's sister")?
- What is your preferred general and casual term for a sale of your unwanted items
(which may be held on your porch, in your yard, garden, or house, from the back
of your car, etc.)?
- What do you call the wheeled contraption in which you carry groceries at the grocery
store or supermarket?
- What do you call a traffic intersection in which several roads meet in a circle
and you have to get off at a certain point?
- Do you pronounce r's when they aren't followed by a vowel, as in car, cart, carton,
and so on?
- How do you pronounce 'sawing' and 'saw it', as in "I enjoying sawing wood" and
"she saw it"?
- How do you pronounce 'Shah of', as in "Abbas was a famous Shah of Iran"?
- How do you pronounce 'which' and 'witch'?
- What do you call the meal you eat in the evening, normally somewhere between 5
and 10 PM?
- What do you call an upholstered seat for more than one person?
- What do you a call a store that is devoted primarily to selling alcoholic beverages?
- What do you call a room equipped with toilets and lavatories for public use?
- What do you call the auxiliary brake that's attached to a rear wheel or the transmission
and keeps the car from moving accidentally?
- What do you call an automobile transmission system in which gears are selected
by the driver by means of a hand-operated gearshift and a foot-operated clutch?
- What do you call an artificial nipple, usually made of plastic, which an infant
can suck or chew on?
- What do you call food purchased at a restaurant to be eaten elsewhere?
- What do you call this large aquatic bug that skims along the surface of water?
- What do you call a narrow street or passageway between or behind buildings?
- What do you call an unattended machine (normally outside a bank) that dispenses
money when a personal coded card is used?
- What do you call your fifth/smallest toe?
- What do you call this long green herb that is used as a garnish or in soups, salads
and stir-fry dishes? (It belongs to the genus Allium and lacks a fully-developed
bulb.)
- How do you pronounce the last vowel in the word "cinema"?
- How do you pronounce the last vowel in the word "happy"?
- How do you pronounce the letter "H"?
- How do you pronounce the name of this small British quick bread (or cake if the
recipe includes sugar)?
- How do you pronounce the past tense of the verb "eat"?
- How do you pronounce the word "again"?
- How do you pronounce the word "bald"?
- How do you pronounce the word "cut"?
- How do you pronounce the word "last"?
- How do you pronounce the word "sandwich"?
- How do you pronounce the word "schedule"?
- How do you pronounce the word "sixth"?
- What do you call a a sandwich made with bread or bread roll (usually white and
buttered) and chips, often with some sort of sauce?
- What do you call a narrow, pedestrian lane found in urban areas which usually
runs between or behind buildings?
- What do you call a rack you dry your clothes on in a house?
- What do you call a small round piece of bread typically used as a side dish?
- What do you call a young person in cheap trendy clothes and jewellery?
- What do you call circular junction in which road traffic must travel in one direction
around a central island?
- What do you call item of clothing worn on the lower part of the body from the
waist to the ankles, covering both legs separately?
- What do you call short undergarments worn on the lower body?
- What do you call the creepy crawly thing that often rolls into a ball when touched?
- What do you call the person who collects and removes rubbish from residential
areas for further processing and disposal?
- What do you call the popular sport played between two teams of eleven players
with a spherical ball?
- What do you say to call for a temporary respite or truce during a game or activity?
- What is your general term for sweetened carbonated beverages?
- What is your general term for the type of rubber-soled shoes that one typically
wears for athletic activities or casual situations?
allow_multiple: no
- required: no
class: integer
name: k
label: Number of neighbours to check
standalone: yes
value: 3.0
length:
min: 1.0
max: 1.0
description: Number of neighbours to check in the k-nearest neighbourgh cluster
limit:
min: 1.0
max: 10.0
- required: no
class: character
name: colp
label: Color palette
standalone: yes
value: Set1
length:
min: 1.0
max: 1.0
description: Color paletter from colorbrewer.com
matchable: yes
options:
- BrBG
- PiYG
- PRGn
- PuOr
- RdBu
- RdGy
- RdYlBu
- RdYlGn
- Spectral
- Accent
- Dark2
- Paired
- Pastel1
- Pastel2
- Set1
- Set2
- Set3
- Blues
- BuGn
- BuPu
- GnBu
- Greens
- Greys
- Oranges
- OrRd
- PuBu
- PuBuGn
- PuRd
- Purples
- RdPu
- Reds
- YlGn
- YlGnBu
- YlOrBr
- YlOrRd
allow_multiple: no
head-->
## Data source
Bert Vaux and Marius L. Jøhndal (University of Cambridge, United Kingdom) have just recently published some exciting results of the [The Cambridge Online Survey of World Englishes](http://www.tekstlab.uio.no/cambridge_survey/) that we try to analyse a bit further below.
## Apologetics
Please note that the below report is generated automatically based on a [statistical report template](http://support.rapporter.net/entries/22471338-What-is-a-template-) and the results, map, tables and all these text is generated real-time or served from cache. This means that you are now reading a non-proofread quick report written by computers.
## Map
First, let us plot the raw results about _<%=q%>_ gathered in the United Kingdom on a terrain map borrowed from [Google](https://developers.google.com/maps/):
<%=
df <- UK_language_data$df
polies <- UK_language_data$polies
poliesM <- UK_language_data$poliesM
bgmap <- UK_language_data$bgmap
smallpolies <- UK_language_data$smallpolies
## load data
#df <- readRDS(system.file('custom-data/UK-survey.RData', package = 'rapport.server'))
## fix levels
levels(df$Q)[2] <- 'What do you call the long cold sandwich that contains cold cuts, lettuce, and so on?'
levels(df$Q)[3] <- 'What is your generic casual or informal term for a sweetened carbonated beverage?'
levels(df$Q) <- gsub('[>|<]', '\'', levels(df$Q))
## filter data
df <- df[which(df$Q == q), ]
df$A <- factor(df$A)
levels(df$A) <- gsub('[>|<]', '\'', levels(df$A))
## order
lt <- as.numeric(table(df$A))
lno <- length(lt)
ln <- min(length(lt), ifelse('other' %in% levels(df$A), 6, 5))
lo <- order(lt, decreasing = TRUE)[1:ln]
lt <- lt[lo]
llo <- names(table(df$A))
ll <- llo[lo]
## drop 5+ cats
if (!'other' %in% ll) {
ll <- c(ll, 'other')
ln <- ln + 1
}
df$A <- as.character(df$A)
ids <- which(!df$A %in% ll)
if (length(ids)>0) {
df$A[ids] <- 'other'
}
df$A <- factor(df$A, levels = ll)
if (table(df$A)[['other']] == 0) {
ll <- setdiff(ll, 'other')
df$A <- factor(df$A, levels = ll)
ln <- ln - 1
}
## strwrap
ll <- sapply(ll, function(l) paste(strwrap(l, 30), collapse = '\n'))
## colors
if (nrow(df) > 0) {
cs <- brewer.pal(ln, colp)
ct <- alpha(cs, 0.4)
df$cs <- df$ct <- df$A
levels(df$cs) <- cs
levels(df$ct) <- ct
}
## map data
#polies <- readRDS(system.file('custom-data/UK-polies.RData', package = 'rapport.server'))
#poliesM <- readRDS(system.file('custom-data/UK-polies-mercator.RData', package = 'rapport.server'))
#bgmap <- readRDS(system.file('custom-data/UK-raster.RData', package = 'rapport.server'))
bgmap@file@name <- "/usr/local/lib/R/site-library/rapport.server/custom-data/UK-raster-raw.gif"
#bgmap@data@names <- system.file('custom-data/UK.raster.raw', package = 'rapport.server')
## cluster
centroids <- coordinates(polies)
require(class)
if (nrow(df) > 0) {
cols <- knn(df[, c('LNG', 'LAT')], centroids, df$ct, k)
}
## update plot settins
evalsOptions('width', 700)
evalsOptions('height', 700)
#evalsOptions('res', 150)
evalsOptions('graph.unify', FALSE)
%>
<%=
## plot
set.caption(q)
%>
<% if (nrow(df) > 0) { %>
<%=
plot(bgmap, maxpixels = 10e7, xaxs = 'i', yaxs = 'i')
+plot(poliesM, add = TRUE, col = as.character(cols))
+points(Mercator(df[, c(2,1)]) , col = as.character(df$cs), pch = '*', cex = 2)
+if (legend("topright", legend = paste0(ll, ' [', lt, ']'), col = cs, pch = '*', box.col = '#B2B2B2', bg = '#B2B2B2', cex = 1, plot = FALSE)$rect$h > 700000) {
legend("topright", legend = paste0(ll, ' [', lt, ']'), col = cs, pch = '*', box.col = '#B2B2B2', bg = '#B2B2B2', cex = 1)
} else {
legend("topright", legend = paste0(ll, ' [', lt, ']'), col = cs, pch = '*', box.col = '#B2B2B2', bg = '#B2B2B2', cex = 1.5)
}
%>
<% } else { %>
<%= plot(bgmap, maxpixels = 10e7, xaxs = 'i', yaxs = 'i') %>
<% } %>
### Responses
You can see the raw results geocoded by the Zip code of the respondents in the above map marked by coloured stars for the <%=lno%> categories offered in the survey. See the legend on the top right corner for details where the number of cases for each category is shown after the labels in square brackets.
<% if (lno > ln) { %>
### Merged categories
Please note that <%=lno-ln%> categories were merged to the "other" category (n=<%=length(which(df$A == 'other'))%>) in the map for convenience:
<%= paste(pandoc.list.return(llo[-lo]), collapse = '\n') %>
<% } %>
### K-nearest neighbours
Beside the <%=nrow(df)%> answers, 192 subdivisions of the United Kingdom is also shown in similar (a bit dimmer and transparent) colours defined by [k-nearest neighbour algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) where _k_ being <%=k%>.This classification method builds and uses the survey data to determine the most likely category for the given subdivision based on the _k_ number of nearest neighbour(s).
This means that setting _k_ to _1_ would find the nearest point to each subdivisions centre and colour the polygons accordingly, and using a higher number for _k_ would return a more smoothed map of colours.
## Language usage across the UK
Although the characteristics of the four countries addressed in this report may be seen in the above map, some more detailed descriptive statistics are also worth noting.
### Observations
<% if (nrow(df) > 0) { %>
<%=
## find country
ps <- attributes(smallpolies)$polygons
df$country <- factor(rowSums(sapply(1:4, function(i) {
pss <- ps[[i]]@Polygons
ifelse(rowSums(sapply(1:length(pss), function(j) {
coordsPolygon <- pss[j][[1]]@coords
point.in.polygon(df$LNG, df$LAT, coordsPolygon[, 1], coordsPolygon[, 2])
})), i, 0)
})))
levels(df$country) <- smallpolies@data$NAME_1
## crosstable
ct <- table(df$A, df$country)
ct
%>
The above table shows the number of geocoded cases for each category in each country, that is just not too informative. A row-percentage table with the marginal and emphasized based on the computed Pearson-residuals might be a lot better to check out.
### Percentages
<%=
ctr <- apply(round(prop.table(addmargins(ct, 2), 2)*100, 2), c(1,2), function(s) paste0(s, '%'))
emphasize.cols(5)
ctres <- suppressWarnings(CrossTable(ct))$CST$stdres
ctre <- which(ctres < -2 | ctres > 2, arr.ind = TRUE)
emphasize.strong.cells(ctre)
set.caption('Residuals being higher than 2 or smaller than -2 are highlighted with bold font')
ctr
%>
The last column of the above table shows the summarized distribution of the answers about _<%=q%>_ that is worth comparing to the country-specific values. The most interesting <%=nrow(ctre)%> values are highlighted based on their residuals.
### Statistical tests
<%=
t <- suppressWarnings(chisq.test(ct))
lambda <- lambda.test(ct)
cramer <- sqrt(as.numeric(t$statistic)/(sum(ct)*min(dim(ct))))
%>
<%if (t$p.value < 0.05) { %>
It seems that a real association can be pointed out between the question and the country ($\chi$=<%=as.numeric(t$statistic)%> at the degree of freedom being <%=as.numeric(t$parameter)%>) at the significance level of <%=t$p.value%>. This means that there is a significance difference in what people think about _<%=q%>_ in the analysed four countries. This association seems to be <%=ifelse(cramer < 0.5, "weak", "strong")%> based on Cramer\'s V (<%=cramer%>).
<% } else { %>
It seems that no real association can be pointed out between the question and the country ($\chi$=<%=as.numeric(t$statistic)%> at the degree of freedom being <%=as.numeric(t$parameter)%>) at the significance level of <%=t$p.value%>. This means that there is no significance difference in what people think about _<%=q%>_ in the analysed four countries. For this end, no further statistical tests were performed.
<% } %>
## Summary
<%=
fraction.to.string <- function(x) {
s <- attr(fractions(x, max.denominator = 10), 'frac')
s <- strsplit(s, '/')[[1]]
s <- as.numeric(s)
if (length(s) == 1 && s == 0)
return('less then one tenth')
if (length(s) == 1 && s == 1)
return('more then nine tenth')
if (s[2] > 10)
s <- c(round(x*10, 0), 10)
s1 <- factor(s[1], levels = 1:9)
levels(s1) <- c('one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine')
s2 <- factor(s[2], levels = 2:10)
levels(s2) <- c('half', 'third', 'fourth', 'fith', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth')
paste(s1, s2)
}
%>
The **most popular category** in the United Kingdom was <<_<%=names(which.max(prop.table(addmargins(ct, 2), 2)[, 5]))%>_>> for <<_<%=q%>_>> chosen by _<%=fraction.to.string(max(prop.table(addmargins(ct, 2), 2)[, 5]))%>_ of the respondents.
<%if (t$p.value < 0.05) { %>
And the most important differences between the countries can be summarised as:
<%=
df$citizen <- df$country
levels(df$citizen) <- c('Brittish', 'Northern Irish', 'Scottish', 'Welsh')
#apply(ctre[!duplicated(ctre[, 1]), ], 1, function(x) {
res <- apply(ctre, 1, function(x) {
paste(sample(c('it seems, that', 'one may say, that', 'in short,', 'eventually,'), 1),
paste(paste0('_', fraction.to.string(prop.table(addmargins(ct, 2), 2)[x[1], x[2]]), '_'), 'of'),
sample(c(paste('people living in', levels(df$country)[x[2]]),
paste(levels(df$citizen)[x[2]], 'people')), 1),
paste(ifelse(ctres[x[1], x[2]] < 0,
sample(c('dislike the answer', 'do not really like the asnwer', 'tends to dislike the answer', 'disagree with', 'do not agree with'), 1),
sample(c('like the answer', 'love the answer', 'tends to like the answer', 'agree with', 'sympathies with'), 1)),
paste0('<<_', row.names(ctres)[x[1]], '_>>'),
'that is', ifelse(ctres[x[1], x[2]] < 0, 'low', 'high'),
sample(c('compared to the average', 'compared to the other countries', 'in a grand avarage', paste('compared to e.g.', sample(setdiff(levels(df$citizen), levels(df$citizen)[x[2]]), 1), sample(c('people', 'citizens'), 1)), paste('compared to lets say', sample(setdiff(levels(df$citizen), levels(df$citizen)[x[2]]), 1), sample(c('people', 'citizens'), 1)), paste('comparing to e.g.', sample(setdiff(levels(df$citizen), levels(df$citizen)[x[2]]), 1), sample(c('people', 'citizens'), 1))), 1)
))
})
paste(pandoc.list.return(res), collapse = '\n')
%>
<% } else { %>
And people tend to think in the same way about _<%=q%>_ all in England, Scotland, Wales and Northern Ireland. Why not give a try to [analyse another question](http://rapporter.net/api/form/b6591e70fa19b53786dc9e1f7e734e5ca26bd4c6e13acffc07fbdc77092d8c55)?
<% } %>
<% } else { %>
There were no responses gathered about _<%=q%>_ in the United Kingdom. Why not give a try to [analyse another question](http://rapporter.net/api/form/b6591e70fa19b53786dc9e1f7e734e5ca26bd4c6e13acffc07fbdc77092d8c55)?
<% } %>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment