Thoughts on a Distributed Underlay
One challenge when navigating a distributed network is how to get a sense of where you are and where you'd like to go. This often means having a sort of map, a fuzzy understanding of the topology of information, without having all the information physically on your system.
This can be close to impossible if information is scattered universally (distributed hash tables in spirit behave like this). But it remains somewhat tricky to efficiently and succinctly describe information useful to navigation.
Each computer can maintain a bloom filter of the facts that it knows and others can see this bloom filter. The nice thing about bloom filters is that they are generally additive, so nodes can broadcast both what they know and what their peers know.
Maybe we need to have some system such that the topology of the network consists entirely of trustworthy nodes (i.e. nodes which are not trying to launch sybil attacks, but sharing information which is of low quality is not relevant to this)
Everyone maintains and broadcasts (at least) these two bloom filters:
- all the information they have
- all the information their peers know about
The second one is interesting because it represents the AND of the first and second fields for each of its peers. In theory that means if anyone anywhere in the network has a piece of information, it should quickly be known by everyone.
Maybe we can make a kind of probablistic bloom filter? Rather than a bit, we have a number. This number grows as that bit is more highly attested (or based on the number of hops needed to reach it). Maybe we compute the arithmetic mean of the corresponding bitvectors. This way the bitvector strength decreases exponentially with distance.
From this we can simply look up a value in the bloom filter and get a sense of how far away that piece of information is— and we can look at the maps of all our peers to see which of our peers are closest to that information.
The fundamental operation in the composition of a query is the computation of an intersection between the results of two larger queries. Thus a distributed query resolution engine fundamentally entails a system for computing an intersection between two computers— ideally without sending all the information across.
Likely the most succinct way to do this is for both nodes to compute bloom filters of the relevant match key and exchange their filters. The filters can get AND'd together, to produce a bloom filter for candidate matches. Each node then filters their list according to the combined bloom filter and exchanges matching results— which should give a superset of the information necessary to compute the intersection (bloom filters have false positives, but no false negatives).
This bloom filter approach to filtering likely minimizes the network overhead, it can essentially summarize millions of entities with a single megabyte hash.
To make this more computationally efficient to query, we can maintain bloom filters of all known lists as a sort of index/summary statistic structure. Whenever a piece of data is added, we simply add it to the bloom filter. This requires us to standardize the hash function and the bloom filter parameters across the network (m=3e7, k=23 ought to be enough for anyone).
However, naively this still requires a linear scan through each respective database to find bloom filter matches— which can potentially pose a problem.
This can likely be addressed with some sort of locality sensitive hashing. For example, data can be hashed according to the bloom filter. The storage can be sharded into a number of buckets, whereby data is added to that bucket if a designated subset of bits are set. For example, one could have a bucket for where any of the first 10 bits are set, another for where any of the 10-20 bit indices are set, etc.
For situations where the intersection is very small compared to the cardinality of the overall data, maybe only one bucket needs to be visited to resolve the query.