leeper/KVF.R

## KVF.R
# The UNF algorithm provides a clearly defined specification for creating a data hash.
# The algorithm has many strengths, including the fact that it is file format-independent,
# controls for floating point and rounding errors, and is independent of the column-
# organization of a dataset (i.e., variable order is irrelevant).
#
# A weakness, however, is that UNF is row-order sensitive. This is useful in terms of
# data subsetting (i.e., a subset of a dataset produces a different UNF than the full
# dataset), but it produces a weakness when two identical datasets are arranged in
# different row orders. Here's a simple example:

library("UNF")
unf(mtcars[1:2,])
# UNF6:MlIby1x5LHlZxU5UfQNgMQ==
unf(mtcars[2:1,])
# UNF6:m8ra3OOjIH1RLXa7Mm31+A==

# Both of these two-row datasets are identical, but have simply had rows rearranged.
# Because row order is meaningless in modern statistical software (e.g., R), this
# is an unnecessary feature of UNF. It might matter in software such as SPSS, SAS, or
# Stata that have sort-dependent operations, but such an approach is fundamentally
# flawed. Thus, what is needed is a data hashing algorithm that is row-independent.
#
# Another problem with UNF is that it ignores variable names. While this makes sense
# for focusing on the core content of the data, it ignores a potentially vital part
# of the metadata associated with a dataset. If variable names change or the the
# label applied to two variables are swapped, the UNF signatures are identical even
# though the usability of the dataset has fundamentally changed. Consider:

unf(mtcars)
# UNF6:lJ2kCuaI9qFfW9XPRhy/aA==
unf(setNames(mtcars, 1:length(mtcars)))
# UNF6:lJ2kCuaI9qFfW9XPRhy/aA==

# Despite completely different variable names, the UNF signatures are identical.
#
# Luckily, the strengths of UNF can be used to build a better algorithm. Namely:
# 1. An improved algorithm would use the UNF specification for formatting variables.
# 2. The column-independent of UNF, involving sorting a dataset would be preserved.
# 3. The hashing algorithm underlying UNF would be preserved.
#
# The proposed hashing strategy involves a key-value store. Any dataset can be
# converted from a rectangular structure into a simple array of tuples, where
# the first element of each tuple is a key and the second element is a value.
#
# The Key: The key would consist of two parts. The first part would be a record/row
# indicator set by the user during the hashing process. This might be a row name,
# or another unique identifier in the dataset. The second part of the key would be
# a variable name indicator.
#
# The Value: The value would the corresponding value from the dataset for that record
# and variable. The variable values would have been pre-standardized following the
# existing UNF specification, so that all of the values are binary representations.
#
# The Algorithm: With a dataset converted to an array of key-value pairs, the array
# would be sorted according to a simple rule (e.g., alphabetically sort by key).
# The final signature would then a hash of the hash for each pair.
#
# For example, the general approach for the following simple dataset would be:

mtcars[1:2,1:2]
#               mpg cyl
# Mazda RX4      21   6
# Mazda RX4 Wag  21   6
keyvalue <- list("Mazda RX4/mpg" = "+21.e+", "Mazda RX4/cyl" = "+6.e+",
                 "Mazda RX4 Wag/mpg" = "+21.e+", "Mazda RX4 Wag/cyl" = "+6.e+")
(pairs <- unname(mapply(paste, names(keyvalue), keyvalue, sep = ":")))
# [1] "Mazda RX4/mpg:+21.e+"     "Mazda RX4/cyl:+6.e+"      "Mazda RX4 Wag/mpg:+21.e+" "Mazda RX4 Wag/cyl:+6.e+"
library("digest")
library("base64enc")
base64encode(digest(pairs, algo = "sha256", serialize=FALSE, raw=TRUE))
# [1] "aY5Q9qR4x+ZsuCqG0chVFm8kr4VHHT8llR+CTQZgdb8="

# Some things to consider:
# 1. Truncation and formatting of record indicators prior to constructing keys
# 2. Truncation and formatting of variable labels prior to constructing keys
# 3. Specification of keys for nested structures (e.g., JSON, databases)
# 4. Sort order of keys
#
# Comments welcome.
#
#
	# The UNF algorithm provides a clearly defined specification for creating a data hash.
	# The algorithm has many strengths, including the fact that it is file format-independent,
	# controls for floating point and rounding errors, and is independent of the column-
	# organization of a dataset (i.e., variable order is irrelevant).
	#
	# A weakness, however, is that UNF is row-order sensitive. This is useful in terms of
	# data subsetting (i.e., a subset of a dataset produces a different UNF than the full
	# dataset), but it produces a weakness when two identical datasets are arranged in
	# different row orders. Here's a simple example:

	library("UNF")
	unf(mtcars[1:2,])
	# UNF6:MlIby1x5LHlZxU5UfQNgMQ==
	unf(mtcars[2:1,])
	# UNF6:m8ra3OOjIH1RLXa7Mm31+A==

	# Both of these two-row datasets are identical, but have simply had rows rearranged.
	# Because row order is meaningless in modern statistical software (e.g., R), this
	# is an unnecessary feature of UNF. It might matter in software such as SPSS, SAS, or
	# Stata that have sort-dependent operations, but such an approach is fundamentally
	# flawed. Thus, what is needed is a data hashing algorithm that is row-independent.
	#
	# Another problem with UNF is that it ignores variable names. While this makes sense
	# for focusing on the core content of the data, it ignores a potentially vital part
	# of the metadata associated with a dataset. If variable names change or the the
	# label applied to two variables are swapped, the UNF signatures are identical even
	# though the usability of the dataset has fundamentally changed. Consider:

	unf(mtcars)
	# UNF6:lJ2kCuaI9qFfW9XPRhy/aA==
	unf(setNames(mtcars, 1:length(mtcars)))
	# UNF6:lJ2kCuaI9qFfW9XPRhy/aA==

	# Despite completely different variable names, the UNF signatures are identical.
	#
	# Luckily, the strengths of UNF can be used to build a better algorithm. Namely:
	# 1. An improved algorithm would use the UNF specification for formatting variables.
	# 2. The column-independent of UNF, involving sorting a dataset would be preserved.
	# 3. The hashing algorithm underlying UNF would be preserved.
	#
	# The proposed hashing strategy involves a key-value store. Any dataset can be
	# converted from a rectangular structure into a simple array of tuples, where
	# the first element of each tuple is a key and the second element is a value.
	#
	# The Key: The key would consist of two parts. The first part would be a record/row
	# indicator set by the user during the hashing process. This might be a row name,
	# or another unique identifier in the dataset. The second part of the key would be
	# a variable name indicator.
	#
	# The Value: The value would the corresponding value from the dataset for that record
	# and variable. The variable values would have been pre-standardized following the
	# existing UNF specification, so that all of the values are binary representations.
	#
	# The Algorithm: With a dataset converted to an array of key-value pairs, the array
	# would be sorted according to a simple rule (e.g., alphabetically sort by key).
	# The final signature would then a hash of the hash for each pair.
	#
	# For example, the general approach for the following simple dataset would be:

	mtcars[1:2,1:2]
	# mpg cyl
	# Mazda RX4 21 6
	# Mazda RX4 Wag 21 6
	keyvalue <- list("Mazda RX4/mpg" = "+21.e+", "Mazda RX4/cyl" = "+6.e+",
	"Mazda RX4 Wag/mpg" = "+21.e+", "Mazda RX4 Wag/cyl" = "+6.e+")
	(pairs <- unname(mapply(paste, names(keyvalue), keyvalue, sep = ":")))
	# [1] "Mazda RX4/mpg:+21.e+" "Mazda RX4/cyl:+6.e+" "Mazda RX4 Wag/mpg:+21.e+" "Mazda RX4 Wag/cyl:+6.e+"
	library("digest")
	library("base64enc")
	base64encode(digest(pairs, algo = "sha256", serialize=FALSE, raw=TRUE))
	# [1] "aY5Q9qR4x+ZsuCqG0chVFm8kr4VHHT8llR+CTQZgdb8="

	# Some things to consider:
	# 1. Truncation and formatting of record indicators prior to constructing keys
	# 2. Truncation and formatting of variable labels prior to constructing keys
	# 3. Specification of keys for nested structures (e.g., JSON, databases)
	# 4. Sort order of keys
	#
	# Comments welcome.
	#
	#