Skip to content

Instantly share code, notes, and snippets.

@leeper
Last active August 29, 2015 14:26
Show Gist options
  • Save leeper/e8bfb6200798d565e484 to your computer and use it in GitHub Desktop.
Save leeper/e8bfb6200798d565e484 to your computer and use it in GitHub Desktop.
A key-value alternative to UNF
# The UNF algorithm provides a clearly defined specification for creating a data hash.
# The algorithm has many strengths, including the fact that it is file format-independent,
# controls for floating point and rounding errors, and is independent of the column-
# organization of a dataset (i.e., variable order is irrelevant).
#
# A weakness, however, is that UNF is row-order sensitive. This is useful in terms of
# data subsetting (i.e., a subset of a dataset produces a different UNF than the full
# dataset), but it produces a weakness when two identical datasets are arranged in
# different row orders. Here's a simple example:
library("UNF")
unf(mtcars[1:2,])
# UNF6:MlIby1x5LHlZxU5UfQNgMQ==
unf(mtcars[2:1,])
# UNF6:m8ra3OOjIH1RLXa7Mm31+A==
# Both of these two-row datasets are identical, but have simply had rows rearranged.
# Because row order is meaningless in modern statistical software (e.g., R), this
# is an unnecessary feature of UNF. It might matter in software such as SPSS, SAS, or
# Stata that have sort-dependent operations, but such an approach is fundamentally
# flawed. Thus, what is needed is a data hashing algorithm that is row-independent.
#
# Another problem with UNF is that it ignores variable names. While this makes sense
# for focusing on the core content of the data, it ignores a potentially vital part
# of the metadata associated with a dataset. If variable names change or the the
# label applied to two variables are swapped, the UNF signatures are identical even
# though the usability of the dataset has fundamentally changed. Consider:
unf(mtcars)
# UNF6:lJ2kCuaI9qFfW9XPRhy/aA==
unf(setNames(mtcars, 1:length(mtcars)))
# UNF6:lJ2kCuaI9qFfW9XPRhy/aA==
# Despite completely different variable names, the UNF signatures are identical.
#
# Luckily, the strengths of UNF can be used to build a better algorithm. Namely:
# 1. An improved algorithm would use the UNF specification for formatting variables.
# 2. The column-independent of UNF, involving sorting a dataset would be preserved.
# 3. The hashing algorithm underlying UNF would be preserved.
#
# The proposed hashing strategy involves a key-value store. Any dataset can be
# converted from a rectangular structure into a simple array of tuples, where
# the first element of each tuple is a key and the second element is a value.
#
# The Key: The key would consist of two parts. The first part would be a record/row
# indicator set by the user during the hashing process. This might be a row name,
# or another unique identifier in the dataset. The second part of the key would be
# a variable name indicator.
#
# The Value: The value would the corresponding value from the dataset for that record
# and variable. The variable values would have been pre-standardized following the
# existing UNF specification, so that all of the values are binary representations.
#
# The Algorithm: With a dataset converted to an array of key-value pairs, the array
# would be sorted according to a simple rule (e.g., alphabetically sort by key).
# The final signature would then a hash of the hash for each pair.
#
# For example, the general approach for the following simple dataset would be:
mtcars[1:2,1:2]
# mpg cyl
# Mazda RX4 21 6
# Mazda RX4 Wag 21 6
keyvalue <- list("Mazda RX4/mpg" = "+21.e+", "Mazda RX4/cyl" = "+6.e+",
"Mazda RX4 Wag/mpg" = "+21.e+", "Mazda RX4 Wag/cyl" = "+6.e+")
(pairs <- unname(mapply(paste, names(keyvalue), keyvalue, sep = ":")))
# [1] "Mazda RX4/mpg:+21.e+" "Mazda RX4/cyl:+6.e+" "Mazda RX4 Wag/mpg:+21.e+" "Mazda RX4 Wag/cyl:+6.e+"
library("digest")
library("base64enc")
base64encode(digest(pairs, algo = "sha256", serialize=FALSE, raw=TRUE))
# [1] "aY5Q9qR4x+ZsuCqG0chVFm8kr4VHHT8llR+CTQZgdb8="
# Some things to consider:
# 1. Truncation and formatting of record indicators prior to constructing keys
# 2. Truncation and formatting of variable labels prior to constructing keys
# 3. Specification of keys for nested structures (e.g., JSON, databases)
# 4. Sort order of keys
#
# Comments welcome.
#
#
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment