Skip to content

Instantly share code, notes, and snippets.

@rich-iannone
Created July 25, 2020 22:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rich-iannone/57ba0cea9b59e2d95a9f1411874f7127 to your computer and use it in GitHub Desktop.
Save rich-iannone/57ba0cea9b59e2d95a9f1411874f7127 to your computer and use it in GitHub Desktop.
A function that reencodes UTF-8 characters to code points
reencode_utf8 <- function(x) {
# Ensure that we encode non-UTF-8 strings to UTF-8 in a
# two-step process: (1) to native encoding, and then
# (2) to UTF-8
if (Encoding(x) != 'UTF-8') {
x <- enc2utf8(x)
}
# Use `iconv()` to convert to UTF-32 (big endian) as
# raw bytes and convert again to integer (crucial here
# to set the base to 16 for this conversion)
raw_bytes <-
iconv(x, "UTF-8", "UTF-32BE", toRaw = TRUE) %>%
unlist() %>%
strtoi(base = 16L)
# Split into a list of four bytes per element
chars <- split(raw_bytes, ceiling(seq_along(raw_bytes) / 4))
x <-
vapply(
chars,
FUN.VALUE = character(1),
USE.NAMES = FALSE,
FUN = function(x) {
bytes_nz <- x[x > 0]
if (length(bytes_nz) > 1) {
out <- paste("\\u", paste(as.hexmode(bytes_nz), collapse = ""), sep = "")
} else if (length(bytes_nz) == 1 && bytes_nz > 127) {
out <- paste("\\u", sprintf("%04s", paste(as.hexmode(bytes_nz)), collapse = ""), sep = "")
} else {
out <- rawToChar(as.raw(bytes_nz))
}
out
}
) %>%
paste(collapse = "")
x
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment