Skip to content

Instantly share code, notes, and snippets.

@bryaakov
Last active March 13, 2018 14:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bryaakov/ae7835ddf860efb9c65bf13c2f7884b9 to your computer and use it in GitHub Desktop.
Save bryaakov/ae7835ddf860efb9c65bf13c2f7884b9 to your computer and use it in GitHub Desktop.
Flatten JSON Design

Flatten JSON

// TODO: Rename feature. Suggestion: Dynamic Nested Indexing

Abstract

  • CM-Well will allow indexing not only by Linked Data predicates, or plain field names (AKA meta/nn), but also by nested JSON attributes.
  • Those attributes will be supplied in a JSON format (see APIs below), and will be searchable using a "path" in the JSON structure (see Examples below).

APIs

  1. Indexing an Infoton with nested attributes: Upload a FileInfoton to the path of your document, with "application/x-cm-well-json" content type header.
  2. Searching by nested attributes: "...&qp=[NestedPath].json[FieldOperator][value]"

Example

Assumming "cm-well" is a cluster name.

Data ingest:

$ curl -X POST cm-well/_in?format=ntriples -H "X-CM-WELL-TOKEN:<WriterToken>" --data-binary '<http://example.org/zebra1> <http://example.org/zebra-ns/name> "Zebra" .'
{"success":true"}

Dynamically indexing:

$ curl -X POST cm-well/example.org/zebra1 -H "X-CM-WELL-TYPE:File" -H "Content-Type:application/x-cm-well-json" -H "X-CM-WELL-TOKEN:<WriterToken>" --data-binary '{
  "pattern": {
    "stripes": {
      "black": [1,3,5,7],
      "white": [0,2,4,6]
    },
    "tail": true
  }
}'
{"success":true"}

Searching by dynamic nested fields:

$ curl "cm-well/example.org?op=stream&recursive&qp=pattern.strips.black.json>5&format=ntriples"
<http://example.org/zebra1> <http://example.org/zebra-ns/name> "Zebra" .
$
$ curl "cm-well/example.org?op=stream&recursive&qp=pattern.tail.json::true&format=ntriples"
<http://example.org/zebra1> <http://example.org/zebra-ns/name> "Zebra" .

Implementation

  • All searchable APIs (search,stream,consume,etc.) will support the ".json" virtual namespace. If supplied, a new FieldNameParser will be used, passing the "first name" (i.e. json path) it as-is to FTS, (FTS will also support such method).
  • For uploading a JSON FileInfoton - if the content type is not application/json but application/x-cm-well-json, FTS will index the json as-is.

Documentation

  • We should make it crystal clear that using this feature assumes the schema is bounded, or otherwise Elasticsearch will blow up.
@bryaakov
Copy link
Author

bryaakov commented Mar 13, 2018

Design Review:

  • Name: Statistics on nested JSON
  • Mimetype: application/json OR application/x-compound-json-object ? TBD
  • Mangling needed? What if user will ingest more than one type per field?
  • Documentation won't help. Perhaps a "Content Stats Dashboard" will do the work.
  • Add an op=aggr example // also - rename the feature to stats (API CHANGE?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment