Skip to content

Instantly share code, notes, and snippets.

@raven-rock
Created July 19, 2021 20:35
Show Gist options
  • Save raven-rock/74df4453ea5416d1d30e70000631c308 to your computer and use it in GitHub Desktop.
Save raven-rock/74df4453ea5416d1d30e70000631c308 to your computer and use it in GitHub Desktop.
Deterministic Key (DK) generation algorithm for scalars, rows, and row sets (batches)
require 'set'
require 'digest/md5'
# Introducing the concept of the *determistic key* (DK)!
# The DK is a deterministic, order-independent (when needed), fixed-length hash
# digest of a scalar value or a collection of values. Utilizing hashing
# algorithms allows for deterministic key generation in distributed systems, and
# thus very parallel-friendly asychronous generation by multiple cores or
# machines.
# The DK provides a convention for an important building block for very
# efficient and simple to reason about change data capture (CDC) implemenations.
# Here are some use cases for DKs:
# 1. Scalars: Generate DKs for scalar values as an alternative primary key than
# the natural keys column or surrogate key (sk). In place of natural primary
# keys, this is sometimes desirable as it may be significantly shorter and thus
# more efficient for key lookups later. In place of SKs, it can save a round
# trip to the data store and/or avoid the surrogate key generation bottle
# neck. Why not use UUIDs as surrogate keys? A generated UUID will always be
# different even for the same object, and sometimes this OK, but there are times
# when this may not be what you want, like during parallel processing.
# 2. Rows: Generate a DK to represent an entire database row/message at a
# time, and don't care about how the row's column keys are ordered.
# 3. Row sets/batches: Generate a DK to represent an entire *set of rows*, or
# "batch", without having to worry about the order of the rows or the order of
# the columns in the rows.
# We will use a hash digest algorithm - MD5 by default - to generate dks. It
# provides good enough collision-free gaurantees, is short (16 bytes binary
# format), and is very fast to compute. MD5 has a widely available API - usually
# one simple function call - in most programming languages and SQL dilects, the
# thus it is simple to implement DK generation in whatever system it's
# needed. Security is a moot feature here, thus SHA1 or SHA256, while technially
# fine to use for DKs, are overkill due to larger digest size and more CPU
# cycles to generate.
# The dk of a *scalar* is the hash of the string representation of the value.
class Object
DK_CLASS = Digest::MD5 # or, Digest::SHA1, etc.
def dk
DK_CLASS.hexdigest(to_s)
end
end
# The dk of a collection/sequence ("enumerable" in Ruby) is the dk its concatenated scalar dks:
module Enumerable
def dk
reduce(DK_CLASS.new) { |acc, x| acc << x.dk }.to_s
end
end
# The dk of a *map* ("hash" in Ruby) is the dk of the dk of the vector of its
# key dks concatenated with dk of the vector of its value dks ordered in sorted
# key order.
class Hash
def dk
sorted_keys = keys.sort
(sorted_keys.dk + values_at(*sorted_keys).dk).dk
end
end
# The dk of *set* is the dk of the concatenation of the *sorted* dks of all it's
# elements.
class Set
def dk
map(&:dk).sort.join.dk
end
end
## Examples.
require_relative './core.rb'
# Scalars.
"a".dk # => "0cc175b9c0f1b6a831c399e269772661"
:a.dk # => "0cc175b9c0f1b6a831c399e269772661"
1.dk # => "c4ca4238a0b923820dcc509a6f75849b"
"".dk # => "d41d8cd98f00b204e9800998ecf8427e"
nil.dk # => "d41d8cd98f00b204e9800998ecf8427e"
# Simple collections.
[1,"","a"].map(&:dk) # => ["c4ca4238a0b923820dcc509a6f75849b", "d41d8cd98f00b204e9800998ecf8427e", "0cc175b9c0f1b6a831c399e269772661"]
[1,"","a"].map(&:dk).join.dk # => "bb24aaf2bc325f5e054225146bc7e18b"
[1,"","a"].dk # => "bb24aaf2bc325f5e054225146bc7e18b"
[1,nil,:a].dk # => "bb24aaf2bc325f5e054225146bc7e18b"
# Simple maps/"rows". Important use case.
{a: 41, b: 42, c: nil}.dk # => "5aa7b8a603f5ef1b60071a363eba8534"
{b: 42, c: nil, a: 41}.dk # => "5aa7b8a603f5ef1b60071a363eba8534"
# Two dimensional collections.
[
{a: 41, b: 42, c: nil},
{b: 42, c: nil, a: 41},
].dk # => "77d6df26e7e52c147a7b6e21f80fa96a"
[
{b: 42, c: nil, a: 41},
{a: 41, b: 42, c: nil},
].dk # => "77d6df26e7e52c147a7b6e21f80fa96a"
# Sets of "rows". Important use case.
[
{a: 41, b: 42, c: nil},
{b: 42, c: nil, a: 41},
].to_set.dk # => "7587b9d41d9afad198bf9b618310cc6c"
[
{b: 42, c: nil, a: 41},
{a: 41, b: 42, c: nil},
{a: 41, b: 42, c: nil},
{b: 42, c: nil, a: 41},
].to_set.dk # => "7587b9d41d9afad198bf9b618310cc6c"
[
{a: 41, b: 42, c: nil},
{a: 51, b: 52, c: nil},
{a: 61, b: 62, c: nil},
{a: 71, b: 72, c: nil},
].to_set.dk # => "0c7901bf1803ad2bb536e7027d7ead49"
[
{a: 51, b: 52, c: nil},
{a: 41, b: 42, c: nil},
{a: 71, b: 72, c: nil},
{a: 61, b: 62, c: nil},
].to_set.dk # => "0c7901bf1803ad2bb536e7027d7ead49"
# WARNING: Does not work correctly on 'non-value-y' objects. Behavior is undefined; use at own risk.
File.new("/etc/passwd") # => #<File:/etc/passwd>
File.new("/etc/passwd").to_s # => "#<File:0x00007fd69688b458>"
File.new("/etc/passwd").to_s.dk # => "46232691ccaf86868473336636d2746b"
File.new("/etc/passwd").dk # => "f577ca10adbe031f8ad73782e3769f33"
File.new("/usr/share/dict/words")
.map(&:chomp)
.take(5)
.map { |w| [w.dk, w] }
# => [["7fc56270e7a70fa81a5935b72eacbe29", "A"],
# ["0cc175b9c0f1b6a831c399e269772661", "a"],
# ["4124bc0a9335c27f086f24ba207a4912", "aa"],
# ["ff45e881572ca2c987460932660d320c", "aal"],
# ["0a1ea2a8d75d02ae052f8222e36927a5", "aalii"]]
File.new("/usr/share/dict/words")
.map(&:chomp)
.take(5)
.dk # => "0069fca8b99ca6506acccbe45d778522"
File.new("/usr/share/dict/words")
.take(5)
.map(&:chomp)
.map(&:dk)
.join
.dk # => "0069fca8b99ca6506acccbe45d778522"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment