Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Brainstorming the different aspects of Indexed Structured Data. Or something like that. A glorious mess, for now.

There are many open questions about a possible implementation of this idea.

Possible serialization formats

Data types

  • numerical data, to allow for range queries (e.g. size, date/time, ...)
  • hashes to be exactly matched for categorical data (references)
    • only hashes, but stuff that isn't already hash could be just quitely hashed
  • semantic hashes, to allow for similarity search
    • one-fits-all size (e.g. 256 bits) v.s. different sizes for different attributes?
    • 256 bits allow for 256 different categories (though more than one bits can be assigned to them to express fuzzier membership)
    • the categories can be a result of manual feature engineering (e.g. library classification), but machine learning can also be used (check out Salakhutdinov and Hinton's paper)

How to select what to index?

There are different ways to go:

  • we have separate data types to mark stuff to be indexed: "indexed hash", "semantic hash", "indexed double"
  • documents have a meta section that tells what needs to be indexed
  • each document belongs to a schema (stored as immutable data), and the schema defines what needs to be indexed
  • to-be-indexed parts have a specific structure, e.g. title: { index: true, type: "book title", value: "The Lord of the Rings"}
    • JSON-LD is a serialization format for RDF, and it does something like this

Exact v.s. proximity queries

Either...

  • we only have one hash type so queries apply to all hashes, always
  • we differentiate between two types of hashes:
    • category / reference: can only be looked up exactly
    • semantic hash: we allow (the more expensive) proximity queries as well

Should attributes be also part of the index?

i.e. (title, "The Lord of the Rings") v.s. just "The Lord of the Rings"

  • no, they aren't necessary, because attributes can be emulated by hashing the combination of the attribute and the value: SHA256(sprintf("%s:%s",attribute,value)) (it doesn't allow for inexact semantic queries restricted to a given attribute)
  • yes, but attributes are just hashed text (e.g. SHA256("book title"))
  • an attribute is a hash that references another document that stores its definition (the previous one can be thought of as a special case of this one)

Access control

A schema definition could specify that only its owners are allowed to post Indexed Structured Data documents for this schema, so no PUTs from others would be accepted. This may lower the possibility of resource abuse.

What should an index contain?

  • just a key => document mapping
  • also the owner of the containing document
  • also the schema of the document (if we have schemas, that is)

How much of the key space should vaults be responsible for?

Index entries are tiny compared to data blocks, so vaults can store a lot of them (500k using 64MB RAM; yes, you would need to keep it in RAM, or else.) So, storage is not really an issue, and we're already used to having to pay for indexing overhead in databases; something for something.

However, proximity searches in XOR space are expensive: indexing doesn't help much, and in the end you'll need to literally count the bits to see how close each entry is. A nice thing though is that vaults will only receive queries that are already close to the entries they store, so other vaults' resources aren't wasted.

All this does not apply to exact queries; they are fast and simple.

Should search queries be free?

Maybe just the regular throttling is enough.

Should indexing impose additional cost on PUTs?

There is clearly a difference between the resource use of an SD block that has zero indexed fields and one that has a thousand semantic hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.