Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@tkuhn
Last active January 31, 2024 13:29
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tkuhn/a3842683b810f1e0dbc72a021083ea31 to your computer and use it in GitHub Desktop.
Save tkuhn/a3842683b810f1e0dbc72a021083ea31 to your computer and use it in GitHub Desktop.

Knowledge Space

— Draft —

Link for sharing: https://w3id.org/knowledge-space/

Vision

You might sometimes ask yourself questions like this:

Does tea consumption improve your memory? What's the evidence and the scientific consensus on this?

The answer is probably out there on the web, but the different pieces of information are not structured or aligned, so we cannot automatically gather, interpret, and aggregate them in a reliable and precise way. That's why current tools like ChatGPT are essentially just guessing and give you results that are incomplete, disconnected, and often wrong.

The knowledge space is a vision to change that. It is an ecosystem that allows for sharing knowledge in a radically more efficient and more effective way.

To get a quick idea on how this will work, these rough comparisons might help:

  • The knowledge space is like the Semantic Web, but robust, scalable, and trust-aware via redundancy and cryptography
  • The knowledge space is like a knowledge graph, but open, decentralized, and collaborative
  • The knowledge space is like a blockchain, but without a chain and with logic statements instead of transactions
  • The knowledge space is like a container environment (such as Docker), but for knowledge instead of software
  • The knowledge space is not like a Large Language Model (LLM) at all, but could be the basis for something like a "Large Knowledge Model" in the future

The vision of the knowledge space is described in more detail below in the form of general principles, process sketches, concrete examples, further discussion, and pointers to the existing partial implementations.

General Principles

Knowledge space

  • The knowledge space is an envisaged open and decentralized global socio-technical ecosystem to share human knowledge
  • In the knowledge space, everything is expressed and communicated in formal logic, in a manner that humans can understand and computers can interpret, using a universal and extensible vocabulary

Knowledge records

  • Statements in the knowledge space are expressed and communicated in small knowledge records, making each record individually reusable and referenceable
  • Each knowledge record includes relevant metadata, including information about the creator of the knowledge record and the source of the statement
  • Each knowledge record is immutable and represented by a unique and cryptographically strong content-based identifier
  • Knowledge records can specify one or several record types they belong to, by which they can later be grouped

Superseding and collections

  • A knowledge record can include a declaration that it retracts or supersedes another knowledge record, thereby allowing for the representation of updates
  • A knowledge record collection is expressed by linking a collection identifier to the identifiers of the knowledge records it includes, and by representing these links in knowledge records themselves

Knowledge agents

  • Users of the knowledge space are called knowledge agents and include people as well as automated agents
  • Knowledge agents connect to the knowledge space via the use of client software
  • The client software ideally runs locally on the computer of the knowledge agent
  • Knowledge agents can also be offered access to online client software that runs on third-party servers (at the cost of a lesser degree of decentralization)

Cryptographic signatures

  • A knowledge record may include a cryptographic signature by the knowledge agent that created it, covering the rest of the knowledge record's content
  • Knowledge records with a signature also include the public key needed to verify the signature
  • A knowledge record contains at most one signature, but additional signatures can be declared in separate knowledge records pointing to the knowledge record to be signed

Knowledge services

  • Knowledge agents can interact with the knowledge space via different kinds of online knowledge services
  • To qualify as such, a knowledge service needs to have several independent sister services with equivalent functionality
  • A publishing/lookup service is a kind of knowledge service that allows knowledge agents to permanently publish knowledge records, that returns the full content of a knowledge record for a provided record identifier, and that provides incremental access to the latest knowledge records added
  • To allow for efficient synchronization, a publishing/lookup service provides incremental access to the latest knowledge records added overall, as well as per public key and per record type
  • To qualify as such, a publishing/lookup service needs to provide its lookup feature freely to any knowledge agent
  • A knowledge record is considered published once it is available via several independent publishing/lookup services in a setting that facilitates further replication by others
  • A query service is a kind of knowledge service that returns data, possibly aggregated, by executing a specified query for the provided input
  • To qualify as such, a query service may only use the published knowledge records and publicly observable behavior of knowledge services as input data to run its query
  • Query services can support queries that target the main statement of knowledge records (e.g. that a certain relation is expressed or a certain entity mentioned), their metadata (e.g. that there is a valid signature or that it was created by a certain knowledge agent), their context (e.g. that it was not retracted or that it was positively assessed by somebody), as well as the behavior of knowledge services (e.g. their responsiveness or execution speed)
  • The results returned by query services can have the form of entire knowledge records, knowledge record identifiers, identifiers of things mentioned in knowledge records, aggregated values, or any other structures derived from the published knowledge records or the behavior of knowledge services

Introduction records

  • Relevant entities, including knowledge agents and knowledge services, are introduced to the knowledge space by publishing knowledge records that describe them, which are called introduction records
  • An introduction record of a knowledge service includes the nature of the service, the kind of knowledge records it covers, and the conditions under which knowledge agents may use it
  • The specification of the kind of knowledge records a knowledge service covers can be done by declaring record types and/or public keys of signers, directly or through a knowledge setting
  • An introduction record of a query service includes or refers to a full specification of its query
  • An introduction record for a knowledge agent includes the public keys the knowledge agent uses or used to sign knowledge records
  • The public keys linked to a knowledge agent via its introduction record can be of three types: main key, secondary key, and obsolete key
  • The main key of a knowledge agent is the public key with which the knowledge agent signs its introduction record (and possibly other knowledge records)
  • A secondary key is any public key apart from the main key that the knowledge agent uses to sign some of its knowledge records
  • An obsolete key is any public key that the knowledge agent used in the past but has been compromised since
  • Knowledge records signed with an obsolete key are only considered valid if they have additionally been signed with a main or secondary key by the respective knowledge agent via a separate knowledge record

Assessments

  • Knowledge agents can provide assessments of entities by expressing a qualified link (e.g. a link representing approval) to an introduction record of the assessed entity, and by publishing this link as a knowledge record

Knowledge settings

  • A knowledge setting, described in an introduction record, serves as a starting point for establishing trust by providing references to other trusted entities
  • A knowledge setting refers to collections of introduction records of trusted knowledge agents and knowledge services, specifies a trust range algorithm, declares an update strategy, and directly includes the minimal information needed to access several publishing/lookup services for bootstrapping purposes
  • A trust range algorithm specifies which knowledge agents and services, and ultimately which knowledge records, should be considered trustworthy, given a knowledge setting and the published knowledge records
  • An update strategy specifies when and how a knowledge setting can be automatically replaced with a newer version

Process Sketches

Finding trusted knowledge agents and services

  • In order to connect to the knowledge space, a knowledge agent's client software needs a local copy of a trusted knowledge setting as a starting point
  • Via the client software, the knowledge agent can then access the bootstrap publishing/lookup services that are listed in the knowledge setting to retrieve the content of the collections of trusted introduction records, thereby obtaining an initial set of trusted knowledge agents and knowledge services
  • The client can then run the trust range algorithm as specified in the knowledge setting, involving calls to the known query and publishing/lookup services, to arrive at a larger and more up-to-date set of trusted entities
  • Running the trust range algorithm typically involves querying for knowledge agents' assessments of other agents and services, calculating some sort of score based on the nature and extent of assessments each potentially trustworthy agent or service has received, establishing some sort of a score threshold to delineate the trustworthy entities, and resolving any conflicts (e.g. when two different knowledge agents claim the same identifier)

Updating or changing the knowledge setting

  • The client software can regularly check via query services whether update candidates for the knowledge setting are available, and whether the specified update strategy allows for automatically replacing the current knowledge setting with the updated one
  • Criteria for the update strategy may include that the updated version is signed by the same knowledge agent, that it has received a sufficient number of positive assessments by trusted knowledge agents, and that a certain amount of time has elapsed since the update was first seen by the client
  • The knowledge agent can at any point manually switch to a different knowledge setting, override any automatic update that has happened, or define a new knowledge setting from scratch

Retrieving knowledge records

  • Given a knowledge record identifier, publishing/lookup services can be asked to provide the corresponding content, and due to the content-based nature of the identifier, the retrieved content can be automatically checked (and another publishing/lookup service can be tried if this check fails)
  • If the knowledge record is digitally signed, this digital signature can also be automatically checked by the client, and the knowledge record can be treated as invalid if the check fails
  • In the case of knowledge record collections, which are published as knowledge records too, this process can be recursively repeated to get the complete content of entire sets of knowledge records

Querying knowledge

  • To query the knowledge space, a knowledge agent can use its client software to contact a trusted knowledge service by sending a query of a form that is supported by this service
  • A knowledge agent may decide to probe the correctness of received results by retrieving a sample of the respective knowledge records via publishing/lookup services, and then checking locally whether the query was correctly applied
  • A knowledge agent may decide to probe the correctness and completeness of received results by checking for discrepancies when querying other equivalent or related query services

Publishing knowledge records

  • To publish a new knowledge record, a knowledge agent can use its client software to send the content of the knowledge record to a publishing/lookup service
  • A knowledge agent may decide to probe the success of a publication request by using its client software to check whether the knowledge record is returned by several independent publishing/lookup and query services

Assessing knowledge services

  • When a knowledge agent has found evidence in favor or against the integrity or quality of a knowledge service, it can publish this finding as an assessment in a knowledge record so others can take it into account

Recovering from compromised private key

  • If a main or secondary key of a knowledge agent has been compromised in the sense that a third party got access to the corresponding private key (and possibly made it unavailable to its legitimate owner), then the affected knowledge agent can recover from it by publishing a new introduction record that declares the compromised key as obsolete and announces a new key, by publishing knowledge records that re-sign with a valid key all legitimate knowledge records previously published with the compromised key, and by convincing other trusted agents to disapprove of its introduction record that includes the compromised key and to approve of the new one

Full Picture

Concrete Examples

Concrete but hypothetical examples are given here in a notation that is deliberately disconnected from the currently existing implementations, in order to emphasize the conceptual core of the approach. Currently existing implementations are discussed afterwards.

Examples of statements

Statements in the examples shown here are written in a predicate logic notation like PREDICATE(ARG1, ARG2, ...). We restrict ourselves here to such atomic formulas and conjunctions thereof. It is easy to see how the expressivity can be arbitrarily expanded by introducing predicates like subclass-of(A, B) or implies(C, D) and agreeing on their semantics (where the latter is aided by technology but is ultimately a social process). Predicate names and constants consist of names with namespaces of the form NAMESPACE/NAME, allowing knowledge agents to define new namespaces when needed.

Constants can also be strings like "some text here". Such strings are formally treated as plain logical constants, but they may contain further informal knowledge for human readers.

A set of atomic formulas conjoint by the logical and are written next to each other on separate lines:

general/related-to(disease/Alzheimers, gene/APOE)
general/related-to(disease/Alzheimers, gene/PSEN1)

Such groups of formulas can be given a label in the form of a hash calculated on their sequence of symbols:

5C2C20:
  general/related-to(disease/Alzheimers, gene/APOE)
  general/related-to(disease/Alzheimers, gene/PSEN1)

For the sake of readability of these examples, fake hashes with just six hexadecimal digits like 5C2C20 are shown here. In reality, these need to be hashes long enough to be considered secure and therefore not vulnerable to collisions. These hashes also serve as logical constants and can therefore appear in argument positions of predicates. They can also be used as namespaces, as in 1CBF40/thing.

Formula groups can be nested, and indentation is used to clarify the nesting structure:

5F3EA4:
  5C2C20:
    general/related-to(disease/Alzheimers, gene/APOE)
    general/related-to(disease/Alzheimers, gene/PSEN1)
  E88197:
    prov/creator(5C2C20, agents/john-doe)

Public keys and signatures are treated and shown similarly to hashes:

sec/has-sig-for-pubkey(5F3EA4, F2E847, 075DF5)

Here, 5F3EA4 is the formula group being signed, F2E847 is the signature, and 075DF5 is the public key.

Hashes can be representing statements containing the hash itself, and signatures can cover statements containing the signature. This can be achieved with a simple trick of introducing an otherwise unused symbol representing the hash to be calculated, another symbol representing the signature to be calculated, and by performing the respective replacement operations during calculation and verification of hash or signature, respectively.

Examples of knowledge records

This is an example of a knowledge record in the domain of diseases and genes:

66C05D:
  234F0C:
    general/related-to(disease/Alzheimers, gene/APOE)
  prov/creator(234F0C, agents/john-doe)
  general/has-date(66C05D, date/20210618-112311)

This is an example of a knowledge record collection (here and below we are showing only minimal metadata and are often omitting the signature for the sake of brevity of these examples):

E3448C:
  8BE0E8:
    collection/has-element(E3448C, 66C05D)
    collection/has-element(E3448C, 22EF2B)
    collection/has-element(E3448C, 0B268E)
  prov/creator(8BE0E8, agents/john-doe)

We are here using the knowledge record identifier E3448C also as the identifier for the collection. To compose collections out of other collections and to break down large collections into small records, we can use sub-collections:

21D64B:
  569B89:
    collection/has-all-elements-of(21D64B, E3448C)
    collection/has-element(21D64B, F0EC38)
  prov/creator(569B89, agents/john-doe)

Published knowledge records cannot be deleted, but only retracted. Retraction happens by publishing another knowledge record stating that the agent retracts the previous record:

2DCBA3:
  E12BBB:
    general/retracts(agents/john-doe, 66C05D)
  prov/creator(E12BBB, agents/john-doe)

A published knowledge record can be updated by publishing a new one that declares to supersede the previous one:

8AD9DC:
  3BF4C8:
    general/related-to(disease/Alzheimers, gene/APOE2)
  prov/creator(3BF4C8, agents/john-doe)
  sec/has-sig-for-pubkey(8AD9DC, CB0CA6, 075DF5)
  general/supersedes(8AD9DC, 66C05D)

Examples of knowledge agents

Knowledge agents can use their own public/private key pair to digitally sign knowledge records:

0C9BBD:
  9977B3:
    general/related-to(disease/Alzheimers, gene/APOE)
  prov/creator(9977B3, agents/john-doe)
  sec/has-sig-for-pubkey(0C9BBD, 574E6A, 075DF5)

Before signing and publishing such knowledge records, knowledge agents should introduce themselves to the knowledge space by publishing an introduction record:

C69CD0:
  E12BBB:
    agent/is-person(agents/john-doe)
    sec/has-main-pubkey(agents/john-doe, 075DF5)
  prov/creator(E12BBB, agents/john-doe)
  general/introduces(C69CD0, agents/john-doe)
  sec/has-sig-for-pubkey(C69CD0, C4B48A, 075DF5)

Knowledge agents can also be bots, i.e. automated computational agents:

C81863:
  E71019:
    agent/is-bot(agents/sai-bot)
    agent/controls(agents/john-doe, agents/sai-bot)
    sec/has-main-pubkey(agents/sai-bot, 1B0B3C)
  prov/creator(E71019, agents/john-doe)
  general/introduces(C81863, agents/sai-bot)
  sec/has-sig-for-pubkey(C81863, B3E959, 1B0B3C)

Knowledge records can only be directly signed with one key pair, but additional signatures can be linked with separate records:

47A8C9:
  520268:
    general/signs(agents/jane-smith, 0C9BBD)
  prov/creator(520268, agents/jane-smith)
  sec/has-sig-for-pubkey(47A8C9, 5AD51E, 35DFD8)

To sign a larger number of knowledge records, they can be grouped in a collection (e.g. E3448C) and signed collectively:

D4DEE3:
  12A93A:
    general/signs-all(agents/jane-smith, E3448C)
  prov/creator(12A93A, agents/jane-smith)
  sec/has-sig-for-pubkey(D4DEE3, AE3275, 35DFD8)

Examples of knowledge services

We assume here that network addresses are represented as logical constants. Therefore some constants, such as the constants representing knowledge services, can be interpreted as places in the network where requests can be sent to and responses can be received from. We use here this simple notation to show examples of requests sent to a network address with an example of a response:

  REQUEST
>> ADDRESS >>
  RESPONSE

We assume here that request and response are logical constants or statements (possibly conjoint and/or nested).

This is an example of a publishing invocation of a publishing/lookup service:

  0C9BBD:
    9977B3:
      general/related-to(disease/Alzheimers, gene/APOE)
    prov/creator(9977B3, agents/john-doe)
    sec/has-sig-for-pubkey(0C9BBD, 574E6A, 075DF5)
>> service/gamma-publish >>
  status/is-published-at(0C9BBD, service/alpha-publishlookup)
  status/is-published-at(0C9BBD, service/yellow-publishlookup)
  status/is-published-at(0C9BBD, service/vanilla-publishlookup)

This is an example of a lookup invocation of a publishing/lookup service:

  0C9BBD
>> service/alpha-publishlookup >>
  0C9BBD:
    9977B3:
      general/related-to(disease/Alzheimers, gene/APOE)
    prov/creator(9977B3, agents/john-doe)
    sec/has-sig-for-pubkey(0C9BBD, 574E6A, 075DF5)

This is an example of a query service:

  general/related-to(disease/Alzheimers, var/x)
>> service/beta-query >>
  result/match-in(var/x, gene/APOE, 0C9BBD)
  result/match-in(var/x, gene/PSEN1, 329E88)

Examples of assessments

This is an assessment in the form of an approval of another knowledge record:

ECA7D7:
  8DECE4:
    general/approves-of(agents/john-doe, 95784D)
  prov/creator(8DECE4, agents/john-doe)

Approval can be seen as the simplest kind of positive assessment, but more detailed assessments are possible with more nuanced relations. A simple negative assessments can look as follows:

6D0C40:
  EEE9FF:
    general/disapproves-of(agents/jane-smith, 95784D)
  prov/creator(EEE9FF, agents/jane-smith)

Example of a knowledge setting

This is an example of a knowledge setting, A2F7B3/ks, introduced in the introduction record A2F7B3:

A2F7B3:
  346797:
    setting/has-trusted-agent-collection(A2F7B3/ks, EC02B0)
    setting/has-trusted-service-collection(A2F7B3/ks, 24385D)
    setting/has-trustrange-algorithm(A2F7B3/ks, setting/basic-tr-algorithm)
    setting/has-update-strategy(A2F7B3/ks, setting/basic-update-strategy)
    setting/has-bootstrap-service(A2F7B3/ks, service/alpha-publishlookup)
    setting/has-bootstrap-service(A2F7B3/ks, service/yellow-publishlookup)
    setting/has-bootstrap-service(A2F7B3/ks, service/vanilla-publishlookup)
  prov/creator(346797, agents/john-doe)
  general/introduces(A2F7B3, A2F7B3/ks)
  sec/has-sig-for-pubkey(A2F7B3, 574E6A, FF1A89)

Example of a trust range algorithm

Like everything in the knowledge space, a trust range algorithm is introduced with an introduction record:

AE952A:
  FAA335:
    setting/is-tr-algorithm(setting/basic-tr-algorithm)
    setting/has-definition(setting/basic-tr-algorithm,
      "Create a set T with all the knowledge agents and services from the given initial trusted
       knowledge records. Then execute these steps:
       1. For every agent in T, query the knowledge space to find all valid knowledge records
          where it expresses approval or disapproval of knowledge services or agents, and add
          these (dis)approval relations to a new set A. Try two suitable sister services in T
          for every query, and use the union of their responses. A knowledge record is
          considered valid if it is signed with a key from local (not online) client software,
          and no retraction or superseding record signed with the same key is published.
       2. For every entity in T, if the number of received disapprovals in A exceeds approvals,
          remove it from T. For every entity not in T, if approvals exceed disapprovals by two
          or more, add it to T.
       3. For all agents in T that have a shared public key (main/secondary/obsolete), keep
          only the one with the most net-approvals in A and remove the others (if tied, remove
          them all). Similarly, remove extra entities in T that share the same identifier.
       All final elements in T are considered trusted entities. Knowledge records are
       considered trusted if and only if they are signed by a trusted knowledge agent or the
       number of trusted knowledge agents who expressed their approval exceeds the ones that
       expressed their disapproval.
      "
    )
  prov/creator(FAA335, agents/john-doe)
  general/introduces(AE952A, setting/basic-tr-algorithm)

The restriction to knowledge records signed with keys from locally running client software is to account for the risk of a data breach of online client software, where a single incident can compromise a large number of knowledge agents.

The actual algorithm is here defined in a string, so a human has to implement this in client software before this algorithm is recognized and can be used. It is easy to imagine, however, how such algorithms can be parametrized or even fully specified in logic, and therefore allow for the definition of new algorithms without the need to change the client implementation.

Example of an update strategy

Update strategies are of course published as knowledge records too:

1861EA:
  370EC5:
    setting/is-update-strategy(setting/basic-update-strategy)
    setting/has-definition(setting/basic-update-strategy,
      "This basic update strategy works as follows:
       - Check once a day whether settings have been published that claim to be updates of the
         current one. Check for each of them how many valid assessments exist by a trusted
         knowledge agent.
       - Out of the update candidates that the client software has first seen more than 2 days
         ago, if exactly one of them has at least two net-approvals then replace the current
         setting with it. 
      "
    )
  prov/creator(370EC5, agents/john-doe)
  general/introduces(1861EA, setting/basic-update-strategy)

Example of a process to determine trusted entities

To automatically determine which entities are to be trusted in the knowledge space, the client software only need a local copy of an introduction record describing a trusted knowledge setting. For this example, we take the knowledge setting in example A2F7B3 above. The client software can find three bootstrap services in it (service/alpha-publishlookup, service/yellow-publishlookup, and service/vanilla-publishlookup), which it can use to retrieve knowledge records. It can also find the identifier of a collection of trusted agents: EC02B0. The client software can now request the content of this record from one of the bootstrap services:

  EC02B0
>> service/alpha-publishlookup >>
  B04A1A:
    EF3AAF:
      collection/has-element(B04A1A, C69CD0)
      collection/has-element(B04A1A, 562EDA)
      collection/has-element(B04A1A, 32C073)
      collection/has-element(B04A1A, E711E9)
      collection/has-element(B04A1A, 4C549F)
    prov/creator(EF3AAF, agents/sue-smith)

The client software first checks whether the content matches the content-based identifier (and if not discards the result and re-tries the request).

The client can now request the content of the introduction records of this collection one by one:

  C69CD0
>> service/yellow-publishlookup >>
  C69CD0:
    E12BBB:
      agent/is-person(agents/john-doe)
      sec/has-main-pubkey(agents/john-doe, 075DF5)
    prov/creator(E12BBB, agents/john-doe)
    general/introduces(C69CD0, agents/john-doe)
    sec/has-sig-for-pubkey(C69CD0, C4B48A, 075DF5)

  562EDA
>> service/vanilla-publishlookup >>
  562EDA:
    5911F8:
      agent/is-person(agents/kate-brown)
      sec/has-main-pubkey(agents/kate-brown, 3DEFFF)
    prov/creator(5911F8, agents/kate-brown)
    general/introduces(562EDA, agents/kate-brown)
    sec/has-sig-for-pubkey(562EDA, 915BE0, 3DEFFF)

...

The client then repeats the same process for the trusted service collection to get introduction records of knowledge services.

Next, the client checks the trust range algorithm, which is in our case setting/basic-tr-algorithm. We can assume here that the client recognizes this algorithm and has a module to execute it as specified in the introduction records AE952A above (if not, it would not be able to use this knowledge setting).

The client starts by creating an initial set of trusted agents and services T, which in this case could look as follows:

T = {
  agent-id-main-pubkey(agents/john-doe, 075DF5),
  agent-id-main-pubkey(agents/kate-brown, 5911F8),
  agent-id-main-pubkey(agents/sue-smith, 682EE1),
  agent-id-main-pubkey(agents/robin-lee, C5ABF5),
  agent-id-main-pubkey(agents/rob-jones, E51A43),
  service-id-type(service/alpha-publishlookup, servicetype/publishlookup),
  service-id-type(service/yellow-publishlookup, servicetype/publishlookup),
  service-id-type(service/vanilla-publishlookup, servicetype/publishlookup),
  service-id-type(service/lion-query, servicetype/basic-query),
  service-id-type(service/hippo-query, servicetype/basic-query),
  service-id-type(service/zebra-query, servicetype/basic-query)
}

Agents and services are here treated as tuples of their identifying information (identifier plus main public key for agents, and identifier plus type for services). Them being stored in a set, multiple occurrences of identical identifying information are treated as a single element.

For each agent in this set, the client queries for the approvals and disapprovals it has published in order to execute step 1 of the algorithm:

  basic-query/get-approval-query(agents/john-doe, 075DF5)
>> service/lion-query >>
  general/approves-of-agent(agents/john-doe, 075DF5, agents/kate-brown, 5911F8)
  general/approves-of-agent(agents/john-doe, 075DF5, agents/alma-gomez, B36CC2)
  general/disapproves-of-agent(agents/john-doe, 075DF5, agents/ray-rich, 2F7EA0)
  general/approves-of-service(agents/john-doe, 075DF5, service/alpha-publishlookup, servicetype/publishlookup)
  general/approves-of-service(agents/john-doe, 075DF5, service/kudu-query, servicetype/basic-query)
  general/disapproves-of-service(agents/john-doe, 075DF5, service/vulture-query, servicetype/basic-query)
  ...

The client runs each of these queries on another service of the same type, e.g. service/hippo-query, and creates the union of their results. Finally, the results of all these queries are merged into a set A of (dis)approvals:

A = {
  general/approves-of-agent(agents/john-doe, 075DF5, agents/kate-brown, 5911F8),
  general/approves-of-agent(agents/kate-brown, 075DF5, agents/bill-taylor, 78F067),
  general/disapproves-of-agent(agents/sue-smith, 682EE1, agents/ray-rich, 2F7EA0),
  general/approves-of-service(agents/robin-lee, C5ABF5, service/hippo-query, servicetype/basic-query),
  general/disapproves-of-service(agents/rob-jones, E51A43, service/zebra-query, servicetype/basic-query),
  ...
}

Next, the client aggregates A to tally the (dis)approvals of the entities. This could look as follows:

a  d  T  entity
-------------------------------------------------------------------------
2  0  X  agent-id-main-pubkey(agents/john-doe, 075DF5)
2  1  X  agent-id-main-pubkey(agents/kate-brown, 5911F8)
3  0  X  agent-id-main-pubkey(agents/sue-smith, 682EE1)
3  1  X  agent-id-main-pubkey(agents/robin-lee, C5ABF5)
1  3  X  agent-id-main-pubkey(agents/rob-jones, E51A43)
2  0     agent-id-main-pubkey(agents/bill-taylor, 78F067)
1  1     agent-id-main-pubkey(agents/ray-rich, 2F7EA0)
3  1     agent-id-main-pubkey(agents/rob-jones, 871524)
2  0     agent-id-main-pubkey(agents/sue-smith, 5A7171)
1  0  X  service-id-type(service/alpha-publishlookup, servicetype/publishlookup)
0  0  X  service-id-type(service/yellow-publishlookup, servicetype/publishlookup)
1  1  X  service-id-type(service/vanilla-publishlookup, servicetype/publishlookup)
3  0  X  service-id-type(service/lion-query, servicetype/basic-query)
1  0  X  service-id-type(service/hippo-query, servicetype/basic-query)
1  2  X  service-id-type(service/zebra-query, servicetype/basic-query)
2  0     service-id-type(service/kudu-query, servicetype/basic-query)
1  1     service-id-type(service/vulture-query, servicetype/basic-query)

For step 2, the client checks whether the number of approvals a minus the number of disapprovals d is high enough for inclusion in the final set of trusted entities. The threshold differs depending on whether the entity is in the current version of T or not (as indicated in the third column). For entities in T, if the net approval count is negative, they are removed from T. This applies here to these two entities, which are consequently removed from T:

agent-id-main-pubkey(agents/rob-jones, E51A43)
service-id-type(service/zebra-query, servicetype/basic-query)

For entities not in T, if net approval is at least two, they are added to T. This is here the case for the following entities, which are therefore added to T:

agent-id-main-pubkey(agents/rob-jones, 871524)
agent-id-main-pubkey(agents/sue-smith, 5A7171)
service-id-type(service/kudu-query, servicetype/basic-query)

T looks now as follows, with the new entries shown at the bottom:

T = {
  agent-id-main-pubkey(agents/john-doe, 075DF5),
  agent-id-main-pubkey(agents/kate-brown, 5911F8),
  agent-id-main-pubkey(agents/sue-smith, 682EE1),
  agent-id-main-pubkey(agents/robin-lee, C5ABF5),
  service-id-type(service/alpha-publishlookup, servicetype/publishlookup),
  service-id-type(service/yellow-publishlookup, servicetype/publishlookup),
  service-id-type(service/vanilla-publishlookup, servicetype/publishlookup),
  service-id-type(service/lion-query, servicetype/basic-query),
  service-id-type(service/hippo-query, servicetype/basic-query),
  agent-id-main-pubkey(agents/rob-jones, 871524),
  agent-id-main-pubkey(agents/sue-smith, 5A7171),
  service-id-type(service/kudu-query, servicetype/basic-query)
}

Note that agents/rob-jones now has a new public key assigned, because the previous entry was removed and a new one added. As we will show in the example below, this can be the result of him successfully recovering from somebody else compromising his public key.

In the above version of T, two different agents claim the identity of agents/sue-smith assigning it two different public keys. This is resolved in step 3. In such cases of identifier or public key collisions, only the one with more net approvals is kept, which is in this case the one with public key 682EE1. This could have been, for example, the result of somebody wrongfully and unsuccessfully trying to steal the established identity of this agent. Such collisions can also happen with secondary and obsolete keys, with the same consequence, but for simplicity we only show the main keys here.

After having resolved this collision, we end up with our final set of trusted entities, consisting of five agents and eight services.

Example of a process of recovering from compromised private key

As an example of a process to recover from a compromised private key, let us have a closer look at agent agents/rob-jones above (let us call him Rob). This shows his introduction record:

4C549F:
  53DD82:
    agent/is-person(agents/rob-jones)
    sec/has-main-pubkey(agents/rob-jones, E51A43)
    sec/has-secondary-pubkey(agents/rob-jones, 95063D)
  prov/creator(53DD82, agents/rob-jones)
  general/introduces(4C549F, agents/rob-jones)
  sec/has-sig-for-pubkey(4C549F, 4D4EB7, E51A43)

Rob declares his main key E51A43 and a secondary key 95063D. Now, let us assume the worst-case scenario of a malicious third party managing to get access to the private key of his main public key E51A43, and moreover making this private key inaccessible to Rob. The challenge is now to re-establish the identity of agents/rob-jones with non-compromised keys.

There are slightly less serious variations of this worst-case scenario, e.g. if a secondary key is affected or the private key is still accessible to its original owner, which are a bit easier to recover from, but for simplicity, we will only cover the worst-case scenario here.

To recover from this, first Rob has to create a new key as a replacement (871524) and to publish a new introduction record where the compromised key is labeled as obsolete:

D3258B:
  3B813A:
    agent/is-person(agents/rob-jones)
    sec/has-main-pubkey(agents/rob-jones, 871524)
    sec/has-secondary-pubkey(agents/rob-jones, 95063D)
    sec/has-obsolete-pubkey(agents/rob-jones, E51A43)
  prov/creator(3B813A, agents/rob-jones)
  general/introduces(D3258B, agents/rob-jones)
  sec/has-sig-for-pubkey(D3258B, E83246, 871524)

Next, Rob has to find all his legitimate knowledge records previously published and signed with the obsolete key. Because the malicious attacker can also use this key and it can therefore no longer be trusted, these knowledge records have to be re-signed with one of the valid keys. Rob therefore comes up with a (potentially large) set of knowledge records to be re-signed (here FE9F55, 39912B, 6C3DC6, ...) and publishes this set as a collection:

2DFD66:
  40D532:
    collection/has-element(2DFD66, FE9F55)
    collection/has-element(2DFD66, 39912B)
    collection/has-element(2DFD66, 6C3DC6)
    ...
  prov/creator(40D532, agents/rob-jones)

This allows Rob now to re-sign all these knowledge records with his new key in a single knowledge record (he could also re-sign them individually):

BDA007:
  32D52B:
    general/signs-all(agents/rob-jones, 2DFD66)
  prov/creator(32D52B, agents/rob-jones)
  sec/has-sig-for-pubkey(BDA007, 94DE0D, 871524)

Given Rob's new introduction record, everything seems fixed, but so far other knowledge agents have no reason to trust it more than the previous one. In fact, it is likely that they will see the new record as an intruder and the old one as the legitimate one.

For Rob to fully recover his identity, he now has to use his connections to convince other trusted knowledge agents to disapprove of the old introduction record and approving of the new one. The part of convincing other agents is a social process, happening for the most part outside of the knowledge space.

After a few trusted knowledge agents have been convinced, this identity clash can show up in automatically created lists of contested identities (which can be implemented as a knowledge service) and this can trigger further attention by other trusted knowledge agents, who might do some further investigation and then contribute to the approval of the new identity.

Once a sufficient number of new approvals have been accumulated (and disapprovals for the old record), trust range algorithms will start selecting the new introduction record as trustworthy and discard the old one. At this point, Rob has succeeded in re-establishing his identity, with a new public key but with the same identifier.

Discussion

Trust Range Algorithms

The trust range algorithm described above is just an example. It lets you build a reasonable level of trust but it can be improved in various ways. For example, knowledge agents who are removed from T because they are deemed untrustworthy still have their vote counted when the algorithm decides which other entities should be deemed trustworthy. Moreover, the algorithm doesn't correct for the fact that knowledge services deemed untrustworthy and removed from T might have been used in earlier steps. Another shortcoming of this particular algorithm is that approvals from the initial trusted agents are needed for an entity to end up in the final set of trusted entities; second-degree approvals are not counted.

Despite these shortcomings, it is difficult for malicious agents to manipulate the system. If such an agent manages to get included in the initial collection of trusted entities, it gets just one vote and is easily outnumbered by the other members. As long as the non-malicious agents hold a solid majority allowing them to get two net-approvals against the malicious ones, the update strategy shown above allows them to fix any such problem and exclude malicious agents with a new update. Getting a malicious service included also only helps if there are at least two of them, in which case with a bit of luck they are both chosen by a client and they can then return manipulated results. But this can also be fixed with a simple update of the knowledge setting.

It is quite easy to imagine algorithms that handle the problems above in a better way, and therefore these are not inherent problems of the knowledge space. They can be solved by defining and implementing more advanced algorithms. Different situations and different problems might require different algorithms anyway. In the knowledge space, such algorithms can co-exist and compete in an open ecosystem.

While such an algorithm can be arbitrarily complex, there is no such thing as a perfect trust range algorithm. This is because one cannot be perfectly sure about anything, ever, no matter what system one is using. It is all about levels of trust.

Knowledge Settings

A knowledge agent's knowledge setting defines what will be shown as trustworthy to the agent by its client software and what will be shown as non-trustworthy or not shown at all. It can like that be perceived as something similar to a filter bubble, but with two important differences. First, it is fully transparent what is happening and why certain things are shown and others not. Second, the knowledge setting can be freely changed by the knowledge agent to check out other respective perspectives. A knowledge agent can even define its own knowledge setting.

Knowledge settings can have a narrow focus, for example by including respected members of a given scientific field, or they can have a broader focus, such as including respected scientists of all kinds of disciplines and other kinds of trusted public figures. These two are not in conflict, as they simply provide different perspectives that the knowledge agents can choose and switch at any moment.

Anybody can define knowledge settings but nobody is forced to use them. One can imagine that big international bodies, such as the UN or the European Commission, could publish such knowledge settings in the future, and they might provide a good default choice for knowledge agents to use.

Openness and Decentralization

The knowledge space is open for everybody to access. By their definition, the lookup features of publishing/lookup services are free to use, accessible to everybody, and available in the form of several independent and redundant instances. Query services on the other hand are not required to be free and open, as they can be arbitrarily complex and therefore also costly to run. However, by their definition, query services only work on data that has been published to the knowledge space and therefore is available via the free and open publishing/lookup services. Every query service is therefore open for competitors that fetch the same data and run the same query. Market forces can therefore make sure that all knowledge services are provided at a fair price, and we can assume that free instances will be available for the simpler kinds of queries.

Interpretations

For the knowledge space to work in the ways outlined above, there needs to be an agreement on the interpretation of some core predicates like general/supersedes, general/approves-of, and sec/has-sig-for-pubkey. Further predicates and namespaces can be introduced by the knowledge agents as needed, by publishing corresponding introduction records. These introduction records can be conflicting, however, when several agents introduce the same predicates in incompatible ways. The knowledge space does not provide a fixed definition of how this is resolved, but provides at least three techniques that allow knowledge agents to find agreement. First, newly minted predicates can use the identifier of their introduction record as namespace, e.g. 7BA3B2/has-property-x, which thereby reliably and unambiguously links to the definition stated in its introduction record. Second, as identifiers can also act as network locations, an identifier's corresponding network location can be set up in such a way that it returns the authoritative introduction record when asked to, thereby delegating the authority question to the networking layer. Third, agreement about an interpretation can be found collectively by publishing approving and disapproving assessments of the respective introduction records, as it is done by trust range algorithms for knowledge agents and knowledge services. These techniques can be applied for single predicates but also for entire namespaces.

Current Implementations

The knowledge space as a whole is still a vision, but most aspects already have partial or full implementations.

Identifiers with namespaces of the kind required by the knowledge space are implemented with IRIs/URIs. With HTTP(S), such identifiers can also serve as network locations, as required for knowledge services. The inclusion of hash values for content-based identifiers can be achieved with Trusty URIs.

The Resource Description Framework (RDF) is an implementation of the logic language that is used to write statements that can then be published as knowledge records.

Nanopublications can be seen as an implementation of knowledge records. Knowledge record collections are implemented with nanopublication indexes.

Nanopub Server is a partial implementation of publishing/lookup service. The nanopub API is an implementation of a query service. Nanodash is an implementation of a client software to assist knowledge agents to access the knowledge space.

License

This text is available under the CC BY 4.0 license. The images were created with Excalidraw, using several of its libraries of visual elements, and are available under the MIT license.

@tkuhn
Copy link
Author

tkuhn commented Jan 11, 2023

@etzm Thanks a lot for your extensive and really valuable comments! You are asking the right questions and pointing to a number of points that indeed need further explanation (and a bit more work).

I recommend exploring some recent developments, such as IPFS and ISCC, to replace or complement some parts of the stack used.

Yes, IPFS is a technology that could be used for the publishing and lookup services as described here. In concrete terms, I tried to tie IPFS to the decentralized nanopublication sharing architecture that we are developing (see the links under Current Implementations), but there were practical issues that prevented me to use IPFS in a way that their hashes would be map-able to the ones I am using for nanopublications. IPFS doesn't hash only the plain content but also has some sort of header (I don't remember the details), which means I cannot feed it nanopublications in such a way that it would produce a hash with a clear one-to-one correspondence to the nanopublication hash I am calculating. So, theoretically it is absolutely compatible and complementary, but there are concrete practical issues.

ISCC is very interesting, but I don't see yet how it would apply here. Knowledge records consisting fully of logic, we can do much smarter and more powerful stuff than "similarity" in a relatively straightforward way. In that sense it's different from the scenarios that seem to lie behind the motivation of ISCC. But I might be overlooking something here.

It would be great to explore some early adoption for specific research fields and I would be glad to help connect to initiatives I am aware of that appear to work on related concepts.

That's great to hear and is of course very welcome!

The choice of ORCID as an identifier (I checked out Nanobench) is “OK”, but in light of an open and permissive protocol, one might consider also allowing other identifiers (i.e. Keybase, Ethereum Wallet addresses).

Fully agreed. The knowledge space doesn't make any such commitment to ORCID, and the existing tools (such as Nanobench) only for practical reasons (to not confuse the user too much at this point).

My primary concerns are related to spam (flooding with wrong information), moderation and curation thereof, and the nature of permanent storage of the information.

With respect to "wrong information", the general standpoint of this vision is to let it happen, not getting in the way of its publication, but for it to be placed in the context of who published it, what reputation that person has, what kind of reactions the publication has received etc. Most types of spam will just be ignored, because they won't meet any criteria for it to be shown to users.

With respect to "flooding", i.e. publishing stuff not for it to be read but to overwhelm the system, that's of course an important problem. I sketch below how this can be handled.

Regarding the risk of link rot, censorship, services becoming unreachable, and going out of business, I am not entirely understanding the concept of where data will be stored and how knowledge records are being updated (given a record can receive multiple signatures over time and also be included in other records is seems there must be somewhere just “one” point for a knowledge record to be stored).

No, there doesn't need to be one point. Knowledge records are identified by their hashes, and once they are available at a number of different independent places, they are highly unlikely to ever disappear again (unless there is an orchestrated effort across these otherwise independent entities, which could/should happen when we talk about "illegal" contributions; see below).

Trusty URIs might be at risk of link-rot?

Some Trusty URIs might not resolve to their designated entity when applying the HTTP protocol to get them. But if their entities are published in a system like the one above, you have other lookup services to try and get them. This is no longer standard HTTP, but could be another standard layer on top. In a sense Trusty URIs are URLs just for convenience, so you can use good old HTTP to get them (in most cases), but the framework described above gives you several options that are highly unlikely to all fail at once.

How is the knowledge space in which I may publish being defined?
I mean, for example, by research field? Who defines that? This will be very important for discovery.

In my terminology, "knowledge space" is the whole global thing. In that knowledge space, you can use knowledge settings to define subsets of agents, services, and records that are deemed trustworthy and to a certain level authoritative. So, you could define a knowledge setting that defines who is in the community of your research field, and thereby look at the knowledge space through that lens, so only seeing things that are in that subset.

What if I want to be spamming a given space? More general: How is moderation taking place in the knowledge records concept?

If you want your contributions be seen by others (or even to publish them; see below), you need to be included (directly or indirectly) in the agent list of knowledge settings that others use. For that you initially need somebody who is already in that list to add you too (or several people, depending on that setting's rules), and if you misbehave you are likely to be excluded again. More fine-grained moderation can also happen on the record-level, but this is not about deciding whether something can be published or not, just about whether something should be shown or not.

There may be an incentive to create a large number of knowledge records. It could become a race to the bottom if # of knowledge records is some form of a metric used for scientific incentivization.

Maybe, for a while, but then people can just adjust their metrics. The knowledge space doesn't enforce any metric, but allows different ones to be used side-by-side and for new ones to be defined and applied.

Are the knowledge services acting here as gatekeepers?

Not in terms of quality (see above), but to an extent with respect to volume. This is something that needs an additional discussion point in the document.

Somebody could sabotage the system by publishing vast numbers of records, or just in good intention flood the system with vast numbers of records until it breaks.

To account for this, all publishing services and query services should specify a knowledge setting (or several of them) and give all agents that are included by that a quota of records. On top of that, these services can allow users to pay for an increased quota. That means to get something published and found by query services, you need to be included in the respective knowledge settings (or set up your own server to join the network), and then you have a limited number of records you can publish.

(Agents should be able to publish their records in several "layers" with different keys used for each layer, so if an agent exceeds the quota of a particular query service, that query service can still use some of the "smaller" layers, where things like introductions and approvals should be kept. I need to write all this up more properly...)

Forging timestamps: Could I publish a record with the wrong time-stamp?

Yes, you can publish a record with a timestamp in the past. You can also create a record now and publish it in a few years. With respect to timestamps in the future, the publishing services should actually complain and not allow for this, and incidents of services or agents seen publishing something with a timestamp in the future could be reported (as a knowledge record, of course) and this could be taken into account by trust assessments.

Collections could be extended at different times with different newer and older but later discovered knowledge records as collections grow.

Yes!

Is such a search implementable in an efficient way?

I don't think there is anything here that would make search of the kind we are used (on the Web etc.) inefficient. We can of course do much more powerful kinds of searches, and not all of them will be efficient, but that's obviously not a downside.

Publishing knowledge records

In this step, one could explore to rely on concepts from IPFS. Still, ideally, the knowledge records would be agnostic to a specific medium of storage and they would be available through multiple, different forms (including library servers, publishers, S3, etc.)..

Yes, this is perfectly compatible with IPFS as far as I see (conceptually, despite the practical problems I mentioned above), and the knowledge space doesn't assume that all the services use the same technology.

Have you considered the possibility of a record whose content needs to be deleted? Raising the question again of where content is stored.

Yes. With respect to "deleted", this can mean two things in the knowledge space: (1) Not visible unless you are very specifically looking for exactly the given record. This can be achieved with retractions as discussed above (possibly with further levels/nuances). It can of course also mean (2) no longer available at all, which might be need if showing the content is illegal, for example. In the latter case, all the lookup and query services would have to be convinced to delete the respective record. Technically, they are not required to, but they might be legally, and this is perfectly possible. (At some point one would of course want to make this some sort of legal framework to streamline this.) But even in case (2), one could imagine that some entities (e.g. governmental entities) keep these records in some sort of secured environment, where they only give very narrow groups of people access on request (think of journalists, researchers, judges, ...).

I hope that makes sense. Many of the things above need some proper write-up and inclusion in the main document. As I said, you asked the right questions, so this gave me an opportunity to sketch some of the answers here that I wanted to give in the main document anyway :)

@blocknomad
Copy link

Nice work! I'd love to see the Knowledge Space integrated with DeSci Nodes' implementation of Research Objects https://desci.com/nodes

@tkuhn
Copy link
Author

tkuhn commented Jun 2, 2023

Yes, interesting! I have just been checking it out.

To use it as a publishing service, the question is whether it's set up to deal with large numbers of small files. Like millions and more. As to store knowledge records, each needs to be identifiable/downloadable individually.

Interesting in any case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment