nagydani/topic-discovery.txt

## topic-discovery.txt
1. Introduction

Each node in the Ethereum network has a set of tags called "topics" by
which other nodes may choose to associate with them. They may indicate
capability, responsibility for certain information or functionality or
any other attribute by which the node may wish to belong to a connected
subnetwork of the Ethereum network. In this document, an infrastructure is
outlined by which nodes of a certain topic may discover one another.

The presented infrastructure can be used for two distinct purposes:
nodes covered by the same topic can use it to form and maintain a
connected network, while nodes not covered by a certain topic might use
it for finding nodes covered by that topic for using a service provides
by such nodes.

Topics are assumed to potentially cover arbitrarily large or small parts
of the Ethereum network, meaning that certain topics can contain only
one node while other topics may span the entire network. Irrespective of
the number of nodes covered by a topic, the discovery infrastructure should
provide a suitable set of bootstrap nodes for new nodes wishing to join
the network or those wishing to use the services of nodes covered by said
topic.

Furthermore, it is assumed, that each node is covered by at most a few dozen
topics.


2. Topic structure

On the lowest level, topics are assumed to be unstructured, denoted by a
string of printable characters. On a higher level, two possible
hierarchical topic structures are defined: one in which nodes covered by
a topic and all its subtopics are searched and another in which nodes covered
by a topic and all its supertopics are searched. Subtopics are denoted by
a string that has the parent topic as its prefix, followed by some delimiting
character.

The two higher-level hierarchies are handled as follows: in the first
case, when nodes covered by subtopics need to be discoverable, nodes
must be tagged by their topic and all their supertopics (e.g a node
covered by "animal.mammal.dog" is also covered by "animal.mammal" and
"animal", so that a search for "animal.mammal" will find it), in the
second case, when nodes covered by supertopics need to be discoverable,
the node doing the discovery also searches for all the supertopics.
Apart from this, the directory is topic sturcture agnostic.

The DHT distance measure between a topic and a node's DHT address is
defined as the XOR of the latter with the hash of the topic string.


3. Architecture

Two states of discovery are clearly distinguished: before and after the
searching node made the first successful handshake with a node covered by
the searched topic.

In the first state, the node searches the topic directory organized as a
DHT described later in more detail. From the topic directory, it will
obtain the contact details (network address, port and public key) of a
relatively small set of matching nodes to which it attempts to connect.
Upon the first successful handshake, it transitions to the second state.

In the second state, nodes covered by a certain topic introduce each
other to the searching node. The strategy followed in this state may
well depend on the requirements of the topic and the desired
(sub-)network topology. Thus, the discussion of this second state search
falls outside of the scope of this document. Nodes can transition back
to the first state if for any reason they disconnect from all other nodes
covered by their topic.

Each node participating in the DHT stores records for a (potentially)
large number of topics and at most a small number of records for each
topic. Each record binds one node to one topic. A record contains, in
addition to the topic, the network address, the port and the public key
of the node as well as a timestamp and a digital signature by the
referred node calculated on the rest of the record.


4. Registration

Nodes regularly register themselves for each topic covering them by pushing
the corresponding records into the DHT. Only registrations with a very recent
timestamp and a correct signature are accepted. If the DHT is direct (no
content forwarding), the originator network address is also checked, as nodes
are only allowed to register themselves.

In order to prevent overload on nodes with addresses close to popular
topics, the time interval for these registrations is controlled as
follows: Upon registration, the timestamp of the most recent registration
is returned to the registering node. The repeated registration is scheduled
at a delay inversely proportional to the age of the most recent record.
If the capacity for records for one topic is full, a random record for
the topic in question is deleted. As a security measure, nodes with overly
frequent registrations may be blacklisted.


5. Retrieval

First, a DHT node sufficiently close to the topic is found. Upon request
(i.e. search by topic), all sufficiently recent records are returned.

If the DHT is implemented without query and data forwarding, this might
result in excessive hammering of nodes close to popular topics. For such
implementation, it is advisable for registering nodes to register
themselves in every step of DHT lookup, not only at nodes closest to the
topic. This way, retrieval lookups are likely to find registrations
closer to themselves and further away from the closest node in the DHT.
Another possible defense against hammering by retrieval requests is to
use them sparingly: both topic member nodes and client nodes must strive
to remain connected to the topic network by queriing their neighbors for
further connections in order to maintain a healthy redundance of
connectivity without resorting to discovery.

Forwarding DHT implementations, however, avoid the issue altogether by
propagating and caching records along the path between the node on which
it was found and the querying node.


6. Proposed implementation strategies

a) Leave the UDP-based Kademlia discovery intact and oblivious of topics,
its only purpose being to maintain a connected network of assorted
Ethereum nodes. The above proposed mechanism would be implemented as a
p2p subprotocol (named, e.g., DIR) with a forwarding Kademlia similar to
that of Swarm, based largely on Swarm codebase. Instead of storing one
Swarm chunk per key, it would store a fixed number of topic discovery
records per key.

b) Add topic discovery directly to the existing UDP-based node discovery
Kademlia and implement the anti-hammering measures proposed in the
previous section.
	1. Introduction

	Each node in the Ethereum network has a set of tags called "topics" by
	which other nodes may choose to associate with them. They may indicate
	capability, responsibility for certain information or functionality or
	any other attribute by which the node may wish to belong to a connected
	subnetwork of the Ethereum network. In this document, an infrastructure is
	outlined by which nodes of a certain topic may discover one another.

	The presented infrastructure can be used for two distinct purposes:
	nodes covered by the same topic can use it to form and maintain a
	connected network, while nodes not covered by a certain topic might use
	it for finding nodes covered by that topic for using a service provides
	by such nodes.

	Topics are assumed to potentially cover arbitrarily large or small parts
	of the Ethereum network, meaning that certain topics can contain only
	one node while other topics may span the entire network. Irrespective of
	the number of nodes covered by a topic, the discovery infrastructure should
	provide a suitable set of bootstrap nodes for new nodes wishing to join
	the network or those wishing to use the services of nodes covered by said
	topic.

	Furthermore, it is assumed, that each node is covered by at most a few dozen
	topics.


	2. Topic structure

	On the lowest level, topics are assumed to be unstructured, denoted by a
	string of printable characters. On a higher level, two possible
	hierarchical topic structures are defined: one in which nodes covered by
	a topic and all its subtopics are searched and another in which nodes covered
	by a topic and all its supertopics are searched. Subtopics are denoted by
	a string that has the parent topic as its prefix, followed by some delimiting
	character.

	The two higher-level hierarchies are handled as follows: in the first
	case, when nodes covered by subtopics need to be discoverable, nodes
	must be tagged by their topic and all their supertopics (e.g a node
	covered by "animal.mammal.dog" is also covered by "animal.mammal" and
	"animal", so that a search for "animal.mammal" will find it), in the
	second case, when nodes covered by supertopics need to be discoverable,
	the node doing the discovery also searches for all the supertopics.
	Apart from this, the directory is topic sturcture agnostic.

	The DHT distance measure between a topic and a node's DHT address is
	defined as the XOR of the latter with the hash of the topic string.


	3. Architecture

	Two states of discovery are clearly distinguished: before and after the
	searching node made the first successful handshake with a node covered by
	the searched topic.

	In the first state, the node searches the topic directory organized as a
	DHT described later in more detail. From the topic directory, it will
	obtain the contact details (network address, port and public key) of a
	relatively small set of matching nodes to which it attempts to connect.
	Upon the first successful handshake, it transitions to the second state.

	In the second state, nodes covered by a certain topic introduce each
	other to the searching node. The strategy followed in this state may
	well depend on the requirements of the topic and the desired
	(sub-)network topology. Thus, the discussion of this second state search
	falls outside of the scope of this document. Nodes can transition back
	to the first state if for any reason they disconnect from all other nodes
	covered by their topic.

	Each node participating in the DHT stores records for a (potentially)
	large number of topics and at most a small number of records for each
	topic. Each record binds one node to one topic. A record contains, in
	addition to the topic, the network address, the port and the public key
	of the node as well as a timestamp and a digital signature by the
	referred node calculated on the rest of the record.


	4. Registration

	Nodes regularly register themselves for each topic covering them by pushing
	the corresponding records into the DHT. Only registrations with a very recent
	timestamp and a correct signature are accepted. If the DHT is direct (no
	content forwarding), the originator network address is also checked, as nodes
	are only allowed to register themselves.

	In order to prevent overload on nodes with addresses close to popular
	topics, the time interval for these registrations is controlled as
	follows: Upon registration, the timestamp of the most recent registration
	is returned to the registering node. The repeated registration is scheduled
	at a delay inversely proportional to the age of the most recent record.
	If the capacity for records for one topic is full, a random record for
	the topic in question is deleted. As a security measure, nodes with overly
	frequent registrations may be blacklisted.


	5. Retrieval

	First, a DHT node sufficiently close to the topic is found. Upon request
	(i.e. search by topic), all sufficiently recent records are returned.

	If the DHT is implemented without query and data forwarding, this might
	result in excessive hammering of nodes close to popular topics. For such
	implementation, it is advisable for registering nodes to register
	themselves in every step of DHT lookup, not only at nodes closest to the
	topic. This way, retrieval lookups are likely to find registrations
	closer to themselves and further away from the closest node in the DHT.
	Another possible defense against hammering by retrieval requests is to
	use them sparingly: both topic member nodes and client nodes must strive
	to remain connected to the topic network by queriing their neighbors for
	further connections in order to maintain a healthy redundance of
	connectivity without resorting to discovery.

	Forwarding DHT implementations, however, avoid the issue altogether by
	propagating and caching records along the path between the node on which
	it was found and the querying node.


	6. Proposed implementation strategies

	a) Leave the UDP-based Kademlia discovery intact and oblivious of topics,
	its only purpose being to maintain a connected network of assorted
	Ethereum nodes. The above proposed mechanism would be implemented as a
	p2p subprotocol (named, e.g., DIR) with a forwarding Kademlia similar to
	that of Swarm, based largely on Swarm codebase. Instead of storing one
	Swarm chunk per key, it would store a fixed number of topic discovery
	records per key.

	b) Add topic discovery directly to the existing UDP-based node discovery
	Kademlia and implement the anti-hammering measures proposed in the
	previous section.