Created
July 19, 2021 20:35
-
-
Save raven-rock/74df4453ea5416d1d30e70000631c308 to your computer and use it in GitHub Desktop.
Deterministic Key (DK) generation algorithm for scalars, rows, and row sets (batches)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'set' | |
require 'digest/md5' | |
# Introducing the concept of the *determistic key* (DK)! | |
# The DK is a deterministic, order-independent (when needed), fixed-length hash | |
# digest of a scalar value or a collection of values. Utilizing hashing | |
# algorithms allows for deterministic key generation in distributed systems, and | |
# thus very parallel-friendly asychronous generation by multiple cores or | |
# machines. | |
# The DK provides a convention for an important building block for very | |
# efficient and simple to reason about change data capture (CDC) implemenations. | |
# Here are some use cases for DKs: | |
# 1. Scalars: Generate DKs for scalar values as an alternative primary key than | |
# the natural keys column or surrogate key (sk). In place of natural primary | |
# keys, this is sometimes desirable as it may be significantly shorter and thus | |
# more efficient for key lookups later. In place of SKs, it can save a round | |
# trip to the data store and/or avoid the surrogate key generation bottle | |
# neck. Why not use UUIDs as surrogate keys? A generated UUID will always be | |
# different even for the same object, and sometimes this OK, but there are times | |
# when this may not be what you want, like during parallel processing. | |
# 2. Rows: Generate a DK to represent an entire database row/message at a | |
# time, and don't care about how the row's column keys are ordered. | |
# 3. Row sets/batches: Generate a DK to represent an entire *set of rows*, or | |
# "batch", without having to worry about the order of the rows or the order of | |
# the columns in the rows. | |
# We will use a hash digest algorithm - MD5 by default - to generate dks. It | |
# provides good enough collision-free gaurantees, is short (16 bytes binary | |
# format), and is very fast to compute. MD5 has a widely available API - usually | |
# one simple function call - in most programming languages and SQL dilects, the | |
# thus it is simple to implement DK generation in whatever system it's | |
# needed. Security is a moot feature here, thus SHA1 or SHA256, while technially | |
# fine to use for DKs, are overkill due to larger digest size and more CPU | |
# cycles to generate. | |
# The dk of a *scalar* is the hash of the string representation of the value. | |
class Object | |
DK_CLASS = Digest::MD5 # or, Digest::SHA1, etc. | |
def dk | |
DK_CLASS.hexdigest(to_s) | |
end | |
end | |
# The dk of a collection/sequence ("enumerable" in Ruby) is the dk its concatenated scalar dks: | |
module Enumerable | |
def dk | |
reduce(DK_CLASS.new) { |acc, x| acc << x.dk }.to_s | |
end | |
end | |
# The dk of a *map* ("hash" in Ruby) is the dk of the dk of the vector of its | |
# key dks concatenated with dk of the vector of its value dks ordered in sorted | |
# key order. | |
class Hash | |
def dk | |
sorted_keys = keys.sort | |
(sorted_keys.dk + values_at(*sorted_keys).dk).dk | |
end | |
end | |
# The dk of *set* is the dk of the concatenation of the *sorted* dks of all it's | |
# elements. | |
class Set | |
def dk | |
map(&:dk).sort.join.dk | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Examples. | |
require_relative './core.rb' | |
# Scalars. | |
"a".dk # => "0cc175b9c0f1b6a831c399e269772661" | |
:a.dk # => "0cc175b9c0f1b6a831c399e269772661" | |
1.dk # => "c4ca4238a0b923820dcc509a6f75849b" | |
"".dk # => "d41d8cd98f00b204e9800998ecf8427e" | |
nil.dk # => "d41d8cd98f00b204e9800998ecf8427e" | |
# Simple collections. | |
[1,"","a"].map(&:dk) # => ["c4ca4238a0b923820dcc509a6f75849b", "d41d8cd98f00b204e9800998ecf8427e", "0cc175b9c0f1b6a831c399e269772661"] | |
[1,"","a"].map(&:dk).join.dk # => "bb24aaf2bc325f5e054225146bc7e18b" | |
[1,"","a"].dk # => "bb24aaf2bc325f5e054225146bc7e18b" | |
[1,nil,:a].dk # => "bb24aaf2bc325f5e054225146bc7e18b" | |
# Simple maps/"rows". Important use case. | |
{a: 41, b: 42, c: nil}.dk # => "5aa7b8a603f5ef1b60071a363eba8534" | |
{b: 42, c: nil, a: 41}.dk # => "5aa7b8a603f5ef1b60071a363eba8534" | |
# Two dimensional collections. | |
[ | |
{a: 41, b: 42, c: nil}, | |
{b: 42, c: nil, a: 41}, | |
].dk # => "77d6df26e7e52c147a7b6e21f80fa96a" | |
[ | |
{b: 42, c: nil, a: 41}, | |
{a: 41, b: 42, c: nil}, | |
].dk # => "77d6df26e7e52c147a7b6e21f80fa96a" | |
# Sets of "rows". Important use case. | |
[ | |
{a: 41, b: 42, c: nil}, | |
{b: 42, c: nil, a: 41}, | |
].to_set.dk # => "7587b9d41d9afad198bf9b618310cc6c" | |
[ | |
{b: 42, c: nil, a: 41}, | |
{a: 41, b: 42, c: nil}, | |
{a: 41, b: 42, c: nil}, | |
{b: 42, c: nil, a: 41}, | |
].to_set.dk # => "7587b9d41d9afad198bf9b618310cc6c" | |
[ | |
{a: 41, b: 42, c: nil}, | |
{a: 51, b: 52, c: nil}, | |
{a: 61, b: 62, c: nil}, | |
{a: 71, b: 72, c: nil}, | |
].to_set.dk # => "0c7901bf1803ad2bb536e7027d7ead49" | |
[ | |
{a: 51, b: 52, c: nil}, | |
{a: 41, b: 42, c: nil}, | |
{a: 71, b: 72, c: nil}, | |
{a: 61, b: 62, c: nil}, | |
].to_set.dk # => "0c7901bf1803ad2bb536e7027d7ead49" | |
# WARNING: Does not work correctly on 'non-value-y' objects. Behavior is undefined; use at own risk. | |
File.new("/etc/passwd") # => #<File:/etc/passwd> | |
File.new("/etc/passwd").to_s # => "#<File:0x00007fd69688b458>" | |
File.new("/etc/passwd").to_s.dk # => "46232691ccaf86868473336636d2746b" | |
File.new("/etc/passwd").dk # => "f577ca10adbe031f8ad73782e3769f33" | |
File.new("/usr/share/dict/words") | |
.map(&:chomp) | |
.take(5) | |
.map { |w| [w.dk, w] } | |
# => [["7fc56270e7a70fa81a5935b72eacbe29", "A"], | |
# ["0cc175b9c0f1b6a831c399e269772661", "a"], | |
# ["4124bc0a9335c27f086f24ba207a4912", "aa"], | |
# ["ff45e881572ca2c987460932660d320c", "aal"], | |
# ["0a1ea2a8d75d02ae052f8222e36927a5", "aalii"]] | |
File.new("/usr/share/dict/words") | |
.map(&:chomp) | |
.take(5) | |
.dk # => "0069fca8b99ca6506acccbe45d778522" | |
File.new("/usr/share/dict/words") | |
.take(5) | |
.map(&:chomp) | |
.map(&:dk) | |
.join | |
.dk # => "0069fca8b99ca6506acccbe45d778522" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment