kocolosk/sdp_heterogeneous_schema_discovery.md

## sdp_heterogeneous_schema_discovery.md

      
    Raw
  

              sdp_heterogeneous_schema_discovery.md
            
          
    Heterogeneous Schema Discovery

Moving documents from Cloudant into dashDB means transforming JSON documents into relational records. To build the correct relational representation, we first have to describe the schema of the JSON document itself. That is a trivial process for a single document but becomes very complex if many JSON documents have to be mapped to many relational records while still fitting a limited set of tables.
The Schema Discovery Process (SDP described in this playbook) has been designed to find such a set of tables suitable to hold as many (ideally all) JSON documents in a Cloudant database. The question is how to compute this set of tables?
The answer is again fairly simple if all JSON documents in a Cloudant database implement the same homogenous schema (i.e. have the same attributes, values of the same type, the same levels of nested objects and arrays). The answer is more complex if the documents have heterogenous schemata.
What is a heterogenous schema?

There are two aspects relevant to heterogenous schema discovery:

Heterogenous data types in the same attribute (e.g. 'age': 30 vs. 'age': 'unknown'). Here we detect the main data type based on distribution statistics and create the table column type (e.g. INT vs. VARCHAR) for the data type with the highest frequency.

All non-conforming values are rejected and end up in the _overflow table. This design addresses the customer situations where 99% of the values are correct but 1% of records use some default value not actually in the same domain.

Heterogenous documents in the same database (e.g. {'product': ...} vs. {'customer': ...}). Here we create multiple fact tables based on schema buckets detected and enumerate the tables.

For a database sales we would create tables sales_1 and sales_2 and load all {'product':...} documents into sales_1 and all {'customer':...} documents into sales_2. Note that documents can deviate at any level and the algorithm would detect even if nested elements have heterogenous schemata (e.g. {'sales':{'product':...}} vs. {'sales':{'customer':...}}
How do I use it?

This advanced SDP implementation has not been documented for the general customer. The UI won't expose the implementation either and it is meant for advanced customers only. We can either forward this document to the customer directly or enable the algorithm for them by following the instructions below.
Here is how to enable it. The _warehouser document described in this playbook has a new set of attributes to allow a user to alter the behavior of the JSON schema discovery algorithm.

Name of the function to be used for the schema discovery algorithm

"sdp_function" with supported values "all_union" (default) and "cluster_union" (NEW)

A threshold value to tune the behavior of "cluster_union" (does not apply to "all_union")

"sdp_cluster_threshold" with supported double values between 0.0 < value < 1.0
sdp_function

"all_union"

The default implementation is "all_union" where a sample of all documents in a database is analyzed. If any two documents don't produce an identical schema, the function creates a union of both schemata. It repeats this process until all schemas in the sample have been analyzed. As a result it creates a single root table with many (potentially sparsely populated) columns.
While this is easy and convenient for most customers, it has a couple of implications:

The resulting root table may exceed the 1024 column limit and can not get created
The resulting root table may exceed the 32K total row length limit when it  contains VARCHAR columns
Completely disjoint data is loaded into the same table (e.g. customer and product documents)

"cluster_union"

A second implementation is "cluster_union" where the analyzed sample in the database can produce two or more distinct schemas. As soon as the difference between any two schemata exceeds the "sdp_cluster_threshold" value, it will result in two disjoint clusters. All documents are mapped to either one of the created clusters according to their schema. As a result we will get multiple root tables with only the relevant columns per table.
This function works well but has not been made the default because:

More than one set of analytical functions have to be created in dashDB for more than one fact table (e.g. a customer analysis and a product analysis)
It is not a-priori clear how to tune the "sdp_cluster_threshold" value as this depends solely on the data itself

sdp_cluster_threshold

Note: The value has to be a double. Int is not supported. The boundaries of 0.0 and 1.0 are not valid values either.
The rational to calculate the right "sdp_cluster_threshold" value is as follows:

There are X attributes (including _id, _rev) in the first document
There are Y attributes (including _id, _rev) in the second document
The similarity is calculated as: (# of common attributes) / (# of all attributes). This is the Jaccard similarity.

If we want to separate two documents into different clusters, we have to set the threshold greater than the similarity between these two documents. As a consequence we get:


If (# of common attributes) == 0 we have similarity=0 and any "sdp_cluster_threshold"  value yields 2 clusters


If (# common attributes) == (# of all attributes) we have similarity=1 and any "sdp_cluster_threshold" value yields 1 cluster


The default value is set to 0.2 and produces relatively few clusters. This helps us to detect mostly disjoint schemas where at most 2 of 10 attributes are identical. If customers have documents with many common attributes (maybe system provided) but subtle differences in fewer attributes (maybe user provided), it may help to increase the "sdp_cluster_threshold" value. Note that this only concerns the attribute name or type but not the value itself.
Change process for sdp_function and sdp_cluster_threshold values

Process for operations with admin account access

Follow these steps to change a value in a customer document without communication to the customer. Manipulate the document in https://<cloudant-user>.cloudant.com/_warehouser/<_id> using acurl.


Make sure to stop the running load

overwrite the "warehouser_state" : "finished" with "warehouser_state" : "STOP"


Verify the request has been stopped

inspect the "warehouser_state" : "stopped"
if the "warehouser_state" never changes to "stopped" and remains in any other state, go ahead and remove the "warehouser_state" attribute manually


Remove the following two objects from the _warehouser document entirely

the discovery status in "discovery_status": {..}"
the replication status in "replication_status": {..}"


Change the "sdp_function" and "sdp_cluster_threshold" values as needed

"sdp_function": "cluster_union" OR "sdp_function": "all_union"
"sdp_cluster_threshold": 0.8 OR any double value > 0.0 and < 1.0


Restart the execution

knife data bag show sdp adm to supply <adm-user>and <adm-password> below
acurl https://<cloudant-user>.cloudant.com/_warehouser/<_id> to supply   <dynamite_user>


{
"warehouse":"<dynamite_user>",
"databases": ["<dynamite_user>"],
"bluemix_username" : "user@us.ibm.com",
"bluemix_password": "password"
}```
* `curl -k -u <adm-user>:<adm-password> -X PUT -d @warehouse.json -H Content-Type:application/json -H X-Cloudant-User:<cloudant-user> https://<cloudant-user>.cloudant.com/_api/v2/partners/dynamite/warehouse`

Process for customers

Communicate these steps directly to the customer:


Make sure to stop the running load with the stop button in the dashboard


Verify the request has been stopped and the dashboard status shows as stopped


Remove two objects from the document you find in your database called _warehouser

the discovery status in "discovery_status": {..}"
the replication status in "replication_status": {..}"


Change the "sdp_function" and "sdp_cluster_threshold" values as needed

"sdp_function": "cluster_union" OR "sdp_function": "all_union"
"sdp_cluster_threshold": 0.8 OR any double value > 0.0 and < 1.0


Repeat the execution with the rescan action in the dashboard


Kown Problems

In situations where both types of 1) heterogenous data types and 2) heterogenous documents appear in the same database it is possible to create tables with no records. The additional table could get created based on the heterogenous documents but records are actually loaded into the _overflow table because of a non-confirming data type. SDP still creates those tables in case the non-conforming data type will eventually be conforming.
For example, a sales figure is unknown for a quarter and instead of a numeric {'q1':0} it uses a non-numeric {'q1':null} or {'q1':'unknown'} value. This document would be rejected and a record of it is created in the _overflow table. Later when the actual sales figure becomes available we have a target table to process the numeric value into. Using the _changes feeds makes it necessary to create tables even if they remain initially empty.